[PPT] - Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New PowerPoint Presentation

SLIDE 1

Marc Mehlman

Statistics

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

“To understand God’s thoughts, we must study statistics, for these are the measure of his purpose.” – Florence Nightingale “Statistics: the mathematical theory of ignorance.” – Morris Kline “Statistics means never having to say you’re certain.” – Anonymous

Marc Mehlman (University of New Haven) Statistics 1 / 48

SLIDE 2

Marc Mehlman

Statistics

Marc Mehlman (University of New Haven) Statistics 3 / 48

SLIDE 4

Marc Mehlman

Introduction

Definition Given a population, one often examines a sample of the population in order to draw inference about the entire population. A variable is a measurable characteristic of individuals within the population. The distribution of a variable is the frequency it obtains it outputs. Data is a variable’s values from the sample. Statistics is the science of drawing inference from data about the population. Example From the 50,000 residents of the town a Milford, 300 where selected randomly and asked what their highest academic degree. The population is the 50,000 residents, the sample is the 300 randomly selected residents and the variable is the level of education of the resident. It was too costly to contact all 50,000 residents so the actual distribution of terminal degrees among the entire population is inferred from the distribution of the terminal degrees of 300 randomly sampled residents. Statistic’s Origins: Anecdotes and noticing patterns in random happenings.

Marc Mehlman (University of New Haven) Statistics 4 / 48

SLIDE 5

Marc Mehlman

Introduction

“Data. Data. Data. I can’t make bricks without clay.” – Sherlock Holmes “In God we trust. All others must bring data.” - W. Edwards Deming Definition (Types of Variables) qualitative (categorical): descriptive Examples: color of eyes, gender, city born in. quantitative: numeric Examples: height, miles per gallon, tempera- ture, etc. Definition (Types of Quantitative Variables) discrete: discrete range Examples: # of children someone has, number of coins in pocket continuous: continuous range Examples: weight, speed

Marc Mehlman (University of New Haven) Statistics 5 / 48

SLIDE 6

Marc Mehlman

Graphical Representation of Distributions

Marc Mehlman (University of New Haven) Statistics 6 / 48

SLIDE 7

Marc Mehlman

Graphical Representation of Distributions

Distribution of a Variable

6

To examine a single variable, we graphically display its distribution.

The distribution of a variable tells us what values it takes and how
ften it takes these values.
Distributions can be displayed using a variety of graphical tools. The

proper choice of graph depends on the nature of the variable.

The distribution of a variable tells us what values it takes and how
ften it takes these values.
Distributions can be displayed using a variety of graphical tools. The

proper choice of graph depends on the nature of the variable. Categorical Variable Pie chart Bar graph Categorical Variable Pie chart Bar graph Quantitative Variable Histogram Stemplot Quantitative Variable Histogram Stemplot

Marc Mehlman (University of New Haven) Statistics 7 / 48

SLIDE 8

Marc Mehlman

Graphical Representation of Distributions

Categorical Variables

7

The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.

Pie Charts show the distribution of a categorical variable as a “pie”

whose slices are sized by the counts or percents for the categories.

Bar Graphs represent each category as a bar whose heights show

the category counts or percents.

Marc Mehlman (University of New Haven) Statistics 8 / 48

SLIDE 9

Marc Mehlman

Graphical Representation of Distributions > pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12) > lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream") > pie(pie.sales, labels = lbls, main="Pie Sales")

Blueberry Cherry Apple Boston Cream Other Vanilla Cream

Pie Sales Marc Mehlman (University of New Haven) Statistics 9 / 48

SLIDE 10

Marc Mehlman

Graphical Representation of Distributions

> counts=c(40,30,20,10) > colors=c("Red","Blue","Green","Brown") > barplot(counts,names.arg=colors,main="Favorite Colors")

Red Blue Green Brown

Favorite Colors

10 20 30 40

Marc Mehlman (University of New Haven) Statistics 10 / 48

SLIDE 11

Marc Mehlman

Graphical Representation of Distributions

Quantitative Variables

9

The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.

Histograms show the distribution of a quantitative variable by

using bars whose height represents the number of individuals who take on a value within a particular class.

Stemplots separate each observation into a stem and a leaf that

are then plotted to display the distribution while maintaining the

riginal values of the variable.

Marc Mehlman (University of New Haven) Statistics 11 / 48

SLIDE 12

Marc Mehlman

Graphical Representation of Distributions

13

For quantitative variables that take many values and/or large datasets.

Divide the possible values into classes (equal widths).
Count how many observations fall into each interval (may change

to percents).

Draw picture representing the distribution―bar heights are

equivalent to the number (percent) of observations in each interval.

Histograms

Marc Mehlman (University of New Haven) Statistics 12 / 48

SLIDE 13

Marc Mehlman

Graphical Representation of Distributions

> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")

Girth of Black Cherry Trees

Diameter in Inches Frequency 8 10 12 14 16 18 20 22 2 4 6 8 10 12

Marc Mehlman (University of New Haven) Statistics 13 / 48

SLIDE 14

Marc Mehlman

Graphical Representation of Distributions

10

To construct a stemplot:

Separate each observation into a stem (first part of the number) and a

leaf (the remaining part of the number).

Write the stems in a vertical column; draw a vertical line to the right of

the stems.

Write each leaf in the row to the right of its stem; order leaves if

desired.

Stemplots

Marc Mehlman (University of New Haven) Statistics 14 / 48

SLIDE 15

Marc Mehlman

Graphical Representation of Distributions > Girth=trees$Girth > stem(Girth) # stem and leaf plot The decimal point is at the | 8 | 368 10 | 57800123447 12 | 099378 14 | 025 16 | 03359 18 | 00 20 | 6 > stem(Girth, scale=2) The decimal point is at the | 8 | 368 9 | 10 | 578 11 | 00123447 12 | 099 13 | 378 14 | 025 15 | 16 | 03 17 | 359 18 | 00 19 | 20 | 6 Marc Mehlman (University of New Haven) Statistics 15 / 48

SLIDE 16

Marc Mehlman

Graphical Representation of Distributions

15

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

You can describe the overall pattern by its shape, center, and

spread.

An important kind of deviation is an outlier, an individual that falls
utside the overall pattern.

Examining Distributions

Marc Mehlman (University of New Haven) Statistics 16 / 48

SLIDE 17

Marc Mehlman

Graphical Representation of Distributions

16

A distribution is symmetric if the right and left sides of the graph are

approximately mirror images of each other.

A distribution is skewed to the right (right-skewed) if the right side of

the graph (containing the half of the observations with larger values) is much longer than the left side.

It is skewed to the left (left-skewed) if the left side of the graph is

much longer than the right side.

Symmetric Symmetric Skewed-left Skewed-left Skewed-right Skewed-right

Examining Distributions

Marc Mehlman (University of New Haven) Statistics 17 / 48

SLIDE 18

Marc Mehlman

Graphical Representation of Distributions

Alaska Florida

An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.

The overall pattern is fairly symmetrical except for two states that clearly do not belong to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier.

Outliers

Marc Mehlman (University of New Haven) Statistics 18 / 48

SLIDE 19

Marc Mehlman

Measuring the Center

Marc Mehlman (University of New Haven) Statistics 19 / 48

SLIDE 20

Marc Mehlman

Measuring the Center

Measures of the Center Definition Given x1, x2, · · · , xn, the sample mean is ¯ x def = x1+x2+···+xn

n

= 1

n

j=1 xj.

The population mean is µ def =

1 N

N

j=1 xj.

If one orders the data from smallest to largest, the median is M def = middle value of data if n is odd the average of the middle two values of data if n is even . Laymen refer to the mean as the average. Example The median sales price of a house in Milford was ✩212,175 for Feb–Apr

2013. If Bill Gates buys a house in Milford for ✩100 million, what will that

do to mean cost of a house in Milford? to the median house in Milford? What is a better measure of the cost of buying a house in Milford, the mean or median?

Marc Mehlman (University of New Haven) Statistics 20 / 48

SLIDE 21

Marc Mehlman

Measuring the Center

“Statistically, if you lie with your head in the oven and your feet in the fridge, on average you will be comfortably warm.” –Anonymous “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates “The average human has one breast and one testicle.” – humorist Des McHale

Marc Mehlman (University of New Haven) Statistics 21 / 48

SLIDE 22

Marc Mehlman

Measuring the Center

24

The mean and median measure center in different ways, and both are useful. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Comparing Mean and Median

Marc Mehlman (University of New Haven) Statistics 22 / 48

SLIDE 23

Marc Mehlman

Measuring the Spread

Marc Mehlman (University of New Haven) Statistics 23 / 48

SLIDE 24

Marc Mehlman

Measuring the Spread

25

A measure of center alone can be misleading.
A useful numerical description of a distribution requires both a

measure of center and a measure of spread. To calculate the quartiles:

Arrange the observations in increasing order and locate the

median M.

The first quartile Q1 is the median of the observations located

to the left of the median in the ordered list.

The third quartile Q3 is the median of the observations

located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 – Q1. To calculate the quartiles:

Arrange the observations in increasing order and locate the

median M.

The first quartile Q1 is the median of the observations located

to the left of the median in the ordered list.

The third quartile Q3 is the median of the observations

located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 – Q1.

How to Calculate the Quartiles and the Interquartile Range How to Calculate the Quartiles and the Interquartile Range

Measuring Spread: The Quartiles

Marc Mehlman (University of New Haven) Statistics 24 / 48

SLIDE 25

Marc Mehlman

Measuring the Spread

Definition (The 1.5 x IQR Rule for Outliers) Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Example

The number of items of mail eleven professors, chosen at random, get on September 3rd 2013 is given below. 18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7. Are there any outlier professors? Solution: We figure out the quartiles: Q1 M Q3 3 5 7 9 11 13 15 16 18 23 35 Since IQR = 18 − 7 = 11 and 35 − Q3 = 17 > 1.5 x IQR = 16.5, one identifies 35 as an outlier.

Marc Mehlman (University of New Haven) Statistics 25 / 48

SLIDE 26

Marc Mehlman

Measuring the Spread

Definition Given x1, x2, · · · , xn, the five number summary is minimum, Q1, M, Q3, maximum. Definition Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskers plot)

1

draw and label a vertical number line that includes the range of the distribution.

2

draw a box from height Q1 to Q3.

3

draw a horizontal line inside the box at the height of the median.

4

draw vertical line segments (whiskers) from the bottom and top of the box to the minimum and maximum data values that are not outliers.

5

sometimes outliers are identified with ◦’s (R does this). Boxplots are often useful when comparing the values of two different variables.

Marc Mehlman (University of New Haven) Statistics 26 / 48

SLIDE 27

Marc Mehlman

Measuring the Spread

> boxplot(trees$Height, main="Heights of Black Cherry Trees") > boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG, + main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")

65 70 75 80 85

Heights of Black Cherry Trees

5

6 7 8 9

Lawyers' Demeanor/Diligence ratings of US Superior Court state judges

Marc Mehlman (University of New Haven) Statistics 27 / 48

SLIDE 28

Marc Mehlman

Measuring the Spread

Definition The population variance is σ2 def =

1 N

N

j=1(xj − µ)2 and the population

standard deviation is σ def = √ σ2 =

1

N

j=1(xj − µ)2.

However one often hase only a random sample to examine, not the entire

poplulation. With only a random sample, one can not calculate the

population mean, µ, so the best one can do is use the sample mean, ¯ x instead. Definition The sample variance is s2 def =

1 n−1

n

j=1(xj − ¯

x)2 and the sample standard deviation is s def = √ s2 =

1

n−1

n

j=1(xj − ¯

x)2. Notice the use of n − 1 instead n for the sample variance and standard deviation.

Marc Mehlman (University of New Haven) Statistics 28 / 48

SLIDE 29

Marc Mehlman

Measuring the Spread

Properties of the Sample Standard Deviation

1 s measures the amount the data is dispersed about the mean. 2 s ≥ 0 and if s = 0 then all the data values are the same. 3 s has the same units of measurement as the data. 4 s is sensitive to the existence of outliers.

Example Suppose our random sample is 4.5, 3.7, 2.8, 5.3, 4.6. Then ¯ x = 1 5 [4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18 s2 = 1 5 − 1

(4.5 − 4.18)2 + (3.7 − 4.18)2 + (2.8 − 4.18)2 + (5.3 − 4.18)2 + (4.6 − 4.18)2

= 0.917 s = √ 0.917 = 0.9576012.

Marc Mehlman (University of New Haven) Statistics 29 / 48

SLIDE 30

Marc Mehlman

Measuring the Spread

34

We now have a choice between two descriptions for center and spread

 Mean and Standard Deviation  Median and Interquartile Range

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

Choosing Measures of Center and Spread Choosing Measures of Center and Spread

Choosing Measures of Center and Spread

Marc Mehlman (University of New Haven) Statistics 30 / 48

SLIDE 31

Marc Mehlman

Measuring the Spread

R Commands: Example > mean(trees$Volume) [1] 30.17097 > median(trees$Volume) [1] 24.2 > summary(trees$Volume)

Min. 1st Qu.

Median Mean 3rd Qu. Max. 10.20 19.40 24.20 30.17 37.30 77.00 > IQR(trees$Volume) [1] 17.9 > var(trees$Volume) [1] 270.2028 > sd(trees$Volume) [1] 16.43785

Marc Mehlman (University of New Haven) Statistics 31 / 48

SLIDE 32

Marc Mehlman

Normal Distribution

Marc Mehlman (University of New Haven) Statistics 32 / 48

SLIDE 33

Marc Mehlman

Normal Distribution

40

A density curve is a curve that:

is always on or above the horizontal axis
has an area of exactly 1 underneath it

A density curve describes the overall pattern of a

distribution. The area under the curve and above any

range of values on the horizontal axis is the proportion

f all observations that fall in that range.

A density curve is a curve that:

is always on or above the horizontal axis
has an area of exactly 1 underneath it

A density curve describes the overall pattern of a

distribution. The area under the curve and above any

range of values on the horizontal axis is the proportion

f all observations that fall in that range.

Density Curves

Marc Mehlman (University of New Haven) Statistics 33 / 48

SLIDE 34

Marc Mehlman

Normal Distribution 41

Our measures of center and spread apply to density curves as well as to actual sets of observations.

The median of a density curve is the equal-areas point―the point that

divides the area under the curve in half.

The mean of a density curve is the balance point, at which the curve

would balance if made of solid material.

The median and the mean are the same for a symmetric density curve.

They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.

The median of a density curve is the equal-areas point―the point that

divides the area under the curve in half.

The mean of a density curve is the balance point, at which the curve

would balance if made of solid material.

The median and the mean are the same for a symmetric density curve.

They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail. Distinguishing the Median and Mean of a Density Curve Distinguishing the Median and Mean of a Density Curve

41

Density Curves

Marc Mehlman (University of New Haven) Statistics 34 / 48

SLIDE 35

Marc Mehlman

Normal Distribution

43

One particularly important class of density curves are the Normal curves, which describe Normal distributions.

All Normal curves are symmetric, single-peaked, and bell-shaped.
A Specific Normal curve is described by giving its mean µ and

standard deviation σ.

Normal Distributions

Marc Mehlman (University of New Haven) Statistics 35 / 48

SLIDE 36

Marc Mehlman

Normal Distribution

44

A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.

The mean of a Normal distribution is the center of the

symmetric Normal curve.

The standard deviation is the distance from the center to the

change-of-curvature points on either side.

We abbreviate the Normal distribution with mean µ and

standard deviation σ as N(µ,σ). A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.

The mean of a Normal distribution is the center of the

symmetric Normal curve.

The standard deviation is the distance from the center to the

change-of-curvature points on either side.

We abbreviate the Normal distribution with mean µ and

standard deviation σ as N(µ,σ).

Normal Distributions

Marc Mehlman (University of New Haven) Statistics 36 / 48

SLIDE 37

Marc Mehlman

Normal Distribution

45

The 68-95-99.7 Rule In the Normal distribution with mean µ and standard deviation σ:

Approximately 68% of the observations fall within σ of µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.

The 68-95-99.7 Rule In the Normal distribution with mean µ and standard deviation σ:

Approximately 68% of the observations fall within σ of µ.
Approximately 95% of the observations fall within 2σ of µ.
Approximately 99.7% of the observations fall within 3σ of µ.

The 68-95-99.7 Rule

Marc Mehlman (University of New Haven) Statistics 37 / 48

SLIDE 38

Marc Mehlman

Normal Distribution

47

All Normal distributions are the same if we measure in units of size σ from the mean µ as center. If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is If a variable x has a distribution with mean µ and standard deviation σ, then the standardized value of x, or its z-score, is

σ μ x z

=

The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1). The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. That is, the standard Normal distribution is N(0,1).

Standardizing Observations

Marc Mehlman (University of New Haven) Statistics 38 / 48

SLIDE 39

Marc Mehlman

Normal Distribution

48

Because all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table. The Standard Normal Table Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left

f z.

The Standard Normal Table Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left

f z.

The Standard Normal Table

Marc Mehlman (University of New Haven) Statistics 39 / 48

SLIDE 40

Marc Mehlman

Normal Distribution

Example Given X ∼ N(5, 3), what is the probability 4 ≤ X ≤ 7? Solution: Using Table: P(4 ≤ X ≤ 7) = P 4 − 5 3 ≤ X − 5 3 ≤ 7 − 5 3

=

P (−0.33 ≤ Z ≤ 0.67) = P(Z ≤ 0.67) − P(Z ≤ −0.33) = 0.7486 − 0.3707 = 0.3779

r

> pnorm(7,5,3) - pnorm(4,5,3) [1] 0.3780661

Marc Mehlman (University of New Haven) Statistics 40 / 48

SLIDE 41

Marc Mehlman

Normal Distribution

Example According to the National Health and Nutrition Examination Study 1976–1980, the heights (in inches) of adult men aged 18–24 are N(70, 2.8). What is the tallest a man aged 18–24 can be and still be in the bottom 10% of all such men of that height? Solution: Using Table: 0.1 = P(X ≤ x) = P X − 70 2.8 ≤ x − 70 2.8

= P
Z ≤ x − 70

2.8

.

Using reverse table lookup one has −1.28 = x − 70 2.8 ⇒ x = 66.416. Or, using R: > qnorm(0.1,70,2.8) [1] 66.41166

Marc Mehlman (University of New Haven) Statistics 41 / 48

SLIDE 42

Marc Mehlman

Normal Distribution

One way to assess if a distribution is indeed approximately normal is to plot the data on a normal quantile plot. The data points are ranked and the percentile ranks are converted to z- scores with Table A. The z-scores are then used for the x axis against which the data are plotted on the y axis of the normal quantile plot.

If the distribution is indeed normal the plot will show a straight

line, indicating a good match between the data and a normal distribution.

Systematic deviations from a straight line indicate a non-normal
distribution. Outliers appear as points that are far away from the
verall pattern of the plot.

55

Normal Quantile Plots

Marc Mehlman (University of New Haven) Statistics 42 / 48

SLIDE 43

Marc Mehlman

Normal Distribution

Normal quantile plots are complex to do by hand, but they are standard features in most statistical software.

Good fit to a straight line: the distribution of rainwater pH values is close to normal. Curved pattern: the data are not normally distributed. Instead, it shows a right skew: a few individuals have particularly long survival times.

56

Normal Quantile Plots

Marc Mehlman (University of New Haven) Statistics 43 / 48

SLIDE 44

Marc Mehlman

Normal Distribution

R commands: > dat=rnorm(500,4,3) > qqnorm(dat); qqline(dat, col="red") > qqnorm(trees$Girth); qqline(trees$Girth, col="red")

●
●
−3

−2 −1 1 2 3 −5 5 10

Normal Q−Q Plot

Theoretical Quantiles Sample Quantiles

● ● ● ● ● ● ● ●
●
● ● ●
●
−2

−1 1 2 8 10 12 14 16 18 20

Normal Q−Q Plot

Theoretical Quantiles Sample Quantiles

Marc Mehlman (University of New Haven) Statistics 44 / 48

SLIDE 45

Marc Mehlman

Misuse of Statistics

Marc Mehlman (University of New Haven) Statistics 45 / 48

SLIDE 46

Marc Mehlman

Misuse of Statistics

“He uses statistics as a drunken man uses lamp–posts . . . for support rather than illumination.” – Andrew Lang (1844–1912) “Figures fool when fools figure.” – Oliver Lancaster, professor of mathematical statistics at Sydney University “There are three kinds of lies: lies, damned lies, and statistics.” – Mark Twain “Facts are stubborn, but statistics are more pliable.” - Mark Twain “Torture numbers, and they’ll confess to anything.” - Gregg Easterbrook Daily Cal: Forty percent of workforce is women, yet only thirty percent of women on tv series work.

Marc Mehlman (University of New Haven) Statistics 46 / 48

SLIDE 47

Marc Mehlman

Chapter #1 R Assignment

Marc Mehlman (University of New Haven) Statistics 47 / 48

SLIDE 48

Marc Mehlman

Chapter #1 R Assignment

Fifty-eight sailors are sampled and their eye color is noted as below blue brown green hazel red 11 32 8 5 2

1 Create a barplot and pie chart of eye color from the sailor sample. 2 Create of a histogram and a stemplot of the height of loblolly trees

from the dataset “Loblolly”. The dataset, “Loblolly” comes with R, just as “trees” does. To observe “Loblolly”, type “Loblolly” at the R prompt (without the quotes). To learn more about the dataset, type “help(Loblolly)” at the R prompt.

3 Find the mean, median, five number summary, variance and standard

deviation from the sample of heights in the dataset “Loblolly”.

4 If X ∼ N(2, 3). Find P(1.3 ≤ X ≤ 5.8). 5 Create a Normal Quantile Plot of the height of loblolly trees from the

dataset “Loblolly” and decide if the distribution of the heights came from a normal distribution.

Marc Mehlman (University of New Haven) Statistics 48 / 48

Statistics

Marc H. Mehlman

marcmehlman@yahoo.com

“To understand God’s thoughts, we must study statistics, for these are the measure of his purpose.” – Florence Nightingale “Statistics: the mathematical theory of ignorance.” – Morris Kline “Statistics means never having to say you’re certain.” – Anonymous

Table of Contents

Introduction

Graphical Representation of Distributions

Measuring the Center

Measuring the Spread

Normal Distribution

Misuse of Statistics

Chapter #1 R Assignment

Statistics

Statistics

Graphical Representation of Distributions

Graphical Representation of Distributions

Distribution of a Variable

Categorical Variables

> counts=c(40,30,20,10) > colors=c("Red","Blue","Green","Brown") > barplot(counts,names.arg=colors,main="Favorite Colors")

Quantitative Variables

Histograms

Stemplots

Examining Distributions

Examining Distributions

Outliers

Measuring the Center

Measuring the Center

Measures of the Center Definition Given x1, x2, · · · , xn, the sample mean is ¯ x def = x1+x2+···+xn

= 1

n

The population mean is µ def =

N

If one orders the data from smallest to largest, the median is M def = middle value of data if n is odd the average of the middle two values of data if n is even . Laymen refer to the mean as the average. Example The median sales price of a house in Milford was ✩212,175 for Feb–Apr

do to mean cost of a house in Milford? to the median house in Milford? What is a better measure of the cost of buying a house in Milford, the mean or median?

Comparing Mean and Median

Measuring the Spread

Measuring the Spread

Measuring Spread: The Quartiles

Definition (The 1.5 x IQR Rule for Outliers) Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Example

Definition Given x1, x2, · · · , xn, the five number summary is minimum, Q1, M, Q3, maximum. Definition Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskers plot)

draw and label a vertical number line that includes the range of the distribution.

draw a box from height Q1 to Q3.

draw a horizontal line inside the box at the height of the median.

draw vertical line segments (whiskers) from the bottom and top of the box to the minimum and maximum data values that are not outliers.

sometimes outliers are identified with ◦’s (R does this). Boxplots are often useful when comparing the values of two different variables.

Definition The population variance is σ2 def =

N

standard deviation is σ def = √ σ2 =

N

However one often hase only a random sample to examine, not the entire

population mean, µ, so the best one can do is use the sample mean, ¯ x instead. Definition The sample variance is s2 def =

n

x)2 and the sample standard deviation is s def = √ s2 =

n

x)2. Notice the use of n − 1 instead n for the sample variance and standard deviation.

Properties of the Sample Standard Deviation

Choosing Measures of Center and Spread

R Commands: Example > mean(trees$Volume) [1] 30.17097 > median(trees$Volume) [1] 24.2 > summary(trees$Volume)

Median Mean 3rd Qu. Max. 10.20 19.40 24.20 30.17 37.30 77.00 > IQR(trees$Volume) [1] 17.9 > var(trees$Volume) [1] 270.2028 > sd(trees$Volume) [1] 16.43785

Normal Distribution

Normal Distribution

A density curve is a curve that:

A density curve describes the overall pattern of a

range of values on the horizontal axis is the proportion

A density curve is a curve that:

A density curve describes the overall pattern of a

range of values on the horizontal axis is the proportion

Density Curves

Density Curves

Normal Distributions

Normal Distributions

The 68-95-99.7 Rule

σ μ x z

Standardizing Observations

The Standard Normal Table

Example Given X ∼ N(5, 3), what is the probability 4 ≤ X ≤ 7? Solution: Using Table: P(4 ≤ X ≤ 7) = P 4 − 5 3 ≤ X − 5 3 ≤ 7 − 5 3

P (−0.33 ≤ Z ≤ 0.67) = P(Z ≤ 0.67) − P(Z ≤ −0.33) = 0.7486 − 0.3707 = 0.3779

> pnorm(7,5,3) - pnorm(4,5,3) [1] 0.3780661

2.8

Using reverse table lookup one has −1.28 = x − 70 2.8 ⇒ x = 66.416. Or, using R: > qnorm(0.1,70,2.8) [1] 66.41166