Business Statistics CONTENTS Comparing two s Comparing more than - - PowerPoint PPT Presentation

▶

Feb 28, 2024 214 likes •539 views

SEVERAL S: COMPARISON Business Statistics CONTENTS Comparing two s Comparing more than two s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further

SLIDE 1

SEVERAL 𝜈S: COMPARISON

Business Statistics

SLIDE 2

Comparing two 𝜈s Comparing more than two 𝜈s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further study CONTENTS

SLIDE 3

Recall the comparison of the means of two independent samples (𝑍

1 and 𝑍 2) with the 𝑢-test:

▪ 𝐼0: 𝜈1 = 𝜈2 ▪ under 𝐼0:

𝑍

1−𝑍 2

𝑡𝑍1−𝑍2

~𝑢df, where df is the number of degrees

f freedom, depending on the assumption of the variances

▪ reject when 𝑢calc =

𝑧1−𝑧2 𝑡𝑍1−𝑍2

> 𝑢crit Can we do this for three samples as well? ▪ 𝐼0: 𝜈1 = 𝜈2 = 𝜈3 COMPARING TWO 𝜈S

We earlier wrote 𝑌1 and 𝑌2 or 𝑌 and 𝑍, so why not 𝑍

1 and 𝑍 2?

SLIDE 4

A first attempt: ▪ think about how 𝐼0: 𝜈1 = 𝜈2 leads to

𝑍

1−𝑍 2

𝜏diff ~𝑢df

▪ try if 𝐼0: 𝜈1 = 𝜈2 = 𝜈3 leads to

𝑍

1−𝑍 2−𝑍 3

𝜏diff

~𝑢df Very wrong! ▪ every year a few students try to do this at their exam, but this is a dead end road to take! COMPARING MORE THAN TWO 𝜈S

SLIDE 5

A second attempt: ▪ pairwise comparisons: 𝑍

1 vs. 𝑍 2 and 𝑍 2 vs 𝑍 3

▪ 𝑍

1 vs. 𝑍 3 not needed

▪ that implies doing two tests

▪ not one

For each test the probability of incorrectly rejecting 𝐼0 is 𝛽 ▪ so with 2 tests, it becomes 1 − 1 − 𝛽 2 ▪ example: 𝛽 = 0.05 gives 0.0975, so almost double So that will not work ▪ think about testing 10 means (45 comparisons), the probability of a wrong decision becomes 90% COMPARING MORE THAN TWO 𝜈S

SLIDE 6

A third attempt: ▪ We can partition the sum of squares into two parts:

▪ between the three groups ▪ within each group

▪ Try to identify sources of variation in a numerical dependent variable 𝑍 (the response variable) ▪ Variation in 𝑍 about its mean is partly explained by a categorical independent variable (the factor, with different levels)

▪ and partly unexplained (random error)

COMPARING MORE THAN TWO 𝜈S

SLIDE 7

Example ▪ Chips defect rates are different in every batch, but are there systematic differences between manufacturers?

▪ numerical dependent variable: chip defect rate (𝑍) ▪ categorical independent variable (one factor with four levels): manufacturer (1-4)

COMPARING MORE THAN TWO 𝜈S

SLIDE 8

Statistical model (formulation 1): ▪ manufaturer 1: 𝑍

𝑗1 = 𝜈1 + 𝜁𝑗1

▪ ... ▪ manufaturer 4: 𝑍

𝑗4 = 𝜈4 + 𝜁𝑗4

Where: ▪ 𝑍 is the defect rate ▪ 𝜈1 is the mean for manufacturer 1 ▪ ... ▪ 𝜈4 is the mean for manufacturer 4 ▪ 𝜁 is the random, unexplained, part Null hypothesis: ▪ 𝜈1 = 𝜈2 = 𝜈3 = 𝜈4

COMPARING MORE THAN TWO 𝜈S

𝜈1 = 𝜈2 = 𝜈3 = 𝜈4

SLIDE 9

Statistical model (formulation 2): ▪ manufaturer 1: 𝑍

𝑗1 = 𝜈 + 𝛽1 + 𝜁𝑗1

▪ ... ▪ manufaturer 4: 𝑍

𝑗4 = 𝜈 + 𝛽4 + 𝜁𝑗4

Where ▪ 𝑍 is the defect rate ▪ 𝜈 is the overall mean defect rate ▪ 𝛽1 is manufacturer 1’s mean deviation from 𝜈 ▪ ... ▪ 𝛽4 is manufacturer 4’s mean deviation from 𝜈 ▪ 𝜁 is the random, unexplained, part Null hypothesis: ▪ 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0

COMPARING MORE THAN TWO 𝜈S

𝜈 + 𝛽1 = ⋯ = 𝜈 + 𝛽4

This “group effect” 𝛽 has nothing to do with the significance level 𝛽!

SLIDE 10

▪ What is the alternative hypothesis 𝐼1?

▪ when 𝐼0: 𝜈1 = 𝜈2 = 𝜈3 = 𝜈4

▪ Formulation 1:

▪ wrong: 𝐼1: 𝜈1 ≠ 𝜈2 ≠ 𝜈3 ≠ 𝜈4 ▪ correct: 𝐼1: 𝑜𝑝𝑢 𝜈1 = 𝜈2 = 𝜈3 = 𝜈4 ▪ or: at least one of the 𝜈s differs from the other 𝜈s

COMPARING MORE THAN TWO 𝜈S

𝜈1 ≠ 𝜈2 ≠ 𝜈3 ≠ 𝜈4 𝜈1 = 𝜈2 = 𝜈3 ≠ 𝜈4

SLIDE 11

▪ What is the alternative hypothesis 𝐼1?

▪ when 𝐼0: 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0

▪ Formulation 2:

▪ wrong: 𝐼1: 𝛽1 ≠ 𝛽2 ≠ 𝛽3 ≠ 𝛽4 ≠ 0 ▪ correct: 𝐼1: 𝑜𝑝𝑢 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0 ▪ or: at least one of the 𝛽s differs from 0

COMPARING MORE THAN TWO 𝜈S

𝛽1 ≠ 𝛽2 ≠ 𝛽3 ≠ 𝛽4 𝛽1 = 𝛽2 = 𝛽3 ≠ 𝛽4

SLIDE 12

We want to investigate possible differences in mean income in Atlanta, Boston, Chicago and Detroit.

a. What is the null hypothesis?
b. Suppose the null hypothesis is rejected. What can you

conclude? EXERCISE 1

SLIDE 13

▪ Define notation:

▪ 𝑧 is the numerical value (e.g., chip defect rate) ▪ 𝑧𝑗𝑘 is the value for observation #𝑗 within treatment #𝑘 (e.g., machine #𝑘) ▪ ത 𝑧∙𝑘 is the average over all observations (𝑗 = 1, … , 𝑜𝑘) within treatment #𝑘 ▪ ത ത 𝑧∙∙ is the average over all observations within all treatments (𝑘 = 1, … , 𝑑)

▪ Analysis of variance (ANOVA) model 𝑍

𝑗𝑘 = 𝜈𝑘 + 𝜁𝑗𝑘

r 𝑍

𝑗𝑘 = 𝜈 + 𝛽𝑘 + 𝜁𝑗𝑘

ANALYSIS OF VARIANCE

Observe the position of the

dots. A dot tells that index

has been averaged over

SLIDE 14

Compare variation within groups to variation between groups ▪ variation within group #𝑘:

▪ 𝑇𝑇𝑋

𝑘 = σ𝑗=1 𝑜𝑘

𝑧𝑗𝑘 − ത 𝑧∙𝑘

2

▪ variation within all groups 𝑘 = 1, … , 𝑑:

▪ 𝑇𝑇𝑋 = σ𝑘=1

𝑑

𝑇𝑇𝑋

𝑘 =

▪ σ𝑘=1

𝑑

σ𝑗=1

𝑜𝑘

𝑧𝑗𝑘 − ത 𝑧∙𝑘

2

▪ variation between the 𝑑 groups

▪ so due to the 𝛽s: ▪ 𝑇𝑇𝐵 = σ𝑘=1

𝑑

𝑜𝑘 ത 𝑧∙𝑘 − ത ത 𝑧∙∙

2

ANALYSIS OF VARIANCE

So 𝑇𝑇𝐵 is the variation around the mean ത 𝑧∙∙ that is explained by the model, by factor “A”

SLIDE 15

Together, 𝑇𝑇𝐵 and 𝑇𝑇𝑋 make up the total variation ▪ variation in entire data set:

▪ 𝑇𝑇𝑈 = σ𝑘=1

𝑑

σ𝑗=1

𝑜𝑘

𝑧𝑗𝑘 − ത ത 𝑧∙∙

2

▪ so

▪ 𝑇𝑇𝑈 = 𝑇𝑇𝐵 + 𝑇𝑇𝑋

▪ Think about the logic: we are comparing several means by comparing two variances

▪ analysis of variance is used to compare 𝜈1, 𝜈2, … , 𝜈𝑑

ANALYSIS OF VARIANCE

So 𝑇𝑇𝑈 is the total variation around the grand mean ഥ 𝑧∙∙ 𝑇𝑇𝑈 𝑇𝑇𝐵 𝑇𝑇𝐹

SLIDE 16

ANALYSIS OF VARIANCE

SLIDE 17

Source of variation 𝑻𝑻 𝐞𝐠 𝑵𝑻 𝑮−ratio between groups (due to factor “A”) 𝑇𝑇𝐵 𝑑 − 1 𝑁𝑇𝐵 = 𝑇𝑇𝐵 𝑑 − 1 𝐺 = 𝑁𝑇𝐵 𝑁𝑇𝑋 within groups 𝑇𝑇𝑋 𝑜 − 𝑑 𝑁𝑇𝑋 = 𝑇𝑇𝑋 𝑜 − 𝑑 total 𝑇𝑇𝑈 𝑜 − 1

ANALYSIS OF VARIANCE

SLIDE 18

We sample from the four cities incomes from 100 persons (30 from Atlanta, 20 from Boston, 25 from Chicago and Detroit).

a. What is 𝑜 and 𝑑 in the previous scheme?
b. Specify 𝑍

𝑗𝑘 = 𝜈𝑘 + 𝜁𝑗𝑘 for the case of the 8th respondent

from Chicago. EXERCISE 2

SLIDE 19

What do we test? ▪ 𝐼0: 𝜈1 = 𝜈2 = ⋯ = 𝜈𝑑 ▪ or equivalently 𝐼0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑑 = 0 How do we test? ▪ by comparing 𝑇𝑇𝐵 and 𝑇𝑇𝑋 ▪ or equivalently 𝑁𝑇𝐵 and 𝑁𝑇𝑋 (which are variances!) ▪ if 𝐼0 is true, 𝑁𝑇𝐵 and 𝑁𝑇𝑋 are expected to be equal ▪ their ratio is the test statistic: 𝐺 = 𝑁𝑇𝐵 𝑁𝑇𝑋 TESTING SIGNIFICANCE OF ANOVA

if this ratio is large, the group averages are likely to differ

SLIDE 20

The test statistic 𝐺 ▪ is likely to be around 1 if 𝐼0 is true ▪ is likely to be much larger than 1 if 𝐼1 is true ▪ has a sampling distribution 𝐺𝑑−1,𝑜−𝑑 under 𝐼0 Here 𝐺df1,df2 is the 𝐺-distribution with df1 and df2 degrees

f freedom

Reject for large values of 𝐺 =

𝑁𝑇𝐵 𝑁𝑇𝑋 only

TESTING SIGNIFICANCE OF ANOVA

because we only reject 𝐼0 if variations between groups are larger than expected under 𝐼0

SLIDE 21

So step 3 becomes: ▪ under 𝐼0, 𝐺 =

𝑁𝑇𝐵 𝑁𝑇𝑋 ∼ 𝐺𝑑−1,𝑜−𝑑

▪ under the assumption: 𝜁𝑗𝑘 ∼ 𝑂 0, 𝜏2 In other words, the assumptions of ANOVA are: ▪ the observations 𝑧𝑗𝑘 should be independent ▪ the sub-populations should be normally distributed ▪ the sub-populations should have equal variances Fortunately, ANOVA is somewhat robust to ▪ departures from normality and ▪ the equal variance assumptions TESTING SIGNIFICANCE OF ANOVA

SLIDE 22

Five step procedure for ANOVA ▪ Step 1:

▪ 𝐼0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑑 = 0 ; 𝐼1: not 𝐼0 ; 𝛽 = 0.05

▪ Step 2:

▪ sample statistic: 𝐺 =

𝑁𝑇𝐵 𝑁𝑇𝑋 ; reject for large values

▪ Step 3:

▪ under 𝐼0: 𝐺 =

𝑁𝑇𝐵 𝑁𝑇𝑋 ∼ 𝐺𝑑−1,𝑜−𝑑

▪ requirement: normal populations with equal variance

▪ Step 4:

▪ calculate 𝐺crit = 𝐺upper;df1,df2,𝛽 ▪

r calculate 𝑞−value = 𝑄 𝐺 ≥ 𝐺calc

▪ Step 5

▪ reject 𝐼0 if 𝐺calc > 𝐺crit ▪

r reject 𝐼0 if 𝑞−value < 𝛽

TESTING SIGNIFICANCE OF ANOVA

SLIDE 23

Rejecting 𝐼0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑑 = 0 ▪ is equivalent to rejecting 𝐼0: 𝜈1 = 𝜈2 = ⋯ = 𝜈𝑑 ▪ means that at least one of the 𝛽s is not 0 ▪ means that at least one of the 𝜈s differs from another 𝜈 ▪ means that there is a “factor effect” or “treatment effect” ▪ that we don’t know (yet) which of the groups is or are significantly different ▪ that we don’t know (yet) if the differing group are significanctly lower of higher

▪ we need a “post-hoc” test to find out which groups are different and in which direction

TESTING SIGNIFICANCE OF ANOVA

SLIDE 24

Example ▪ Context:

▪ you want to see if three different golf clubs yield different distances ▪ you randomly select five measurements from trials on an automated driving machine for each club ▪ at the 0.05 significance level, is there a difference in mean distance?

PERFORMING ANOVA

SLIDE 25

The three means are different But are they statistically (significanty) different? Here: ▪ ത ത 𝑍

∙∙ = 232.1

▪ ത 𝑍

∙1 = 249.2

▪ ത 𝑍

∙2 = 226.0

▪ ത 𝑍

∙3 = 221.0

PERFORMING ANOVA

SLIDE 26

Organizing the data in SPSS ▪ independent samples, so not ▪ but rather ▪ and then PERFORMING ANOVA

SLIDE 27

PERFORMING ANOVA

𝑇𝑇𝐵 𝑇𝑇𝑋 𝑇𝑇𝑈 = 𝑇𝑇𝐵 + 𝑇𝑇𝑋 𝑁𝑇 = 𝑇𝑇 𝑒𝑔 𝐺 = 𝑁𝑇𝐵 𝑁𝑇𝑋 𝑞-value (one-tailed)

SLIDE 28

In the 5-step procedure the underlying model is not mentioned But it can be useful to mention it ▪ “Step 0”: 𝑍

𝑗𝑘 = 𝜈 + 𝛽𝑘 + 𝜁𝑗𝑘 with 𝜁𝑗𝑘 ∼ 𝑂 0, 𝜏2

In fact, we could also do that in our previous tests: ▪ one-sample 𝜈: 𝑌𝑗 = 𝜈 + 𝜁𝑗 with 𝜁𝑗 ∼ 𝑂 0, 𝜏2 ▪ two-sample 𝜈: 𝑍

𝑗1 = 𝜈1 + 𝜁𝑗 and 𝑍 𝑗2 = 𝜈2 + 𝜁𝑗 with

𝜁𝑗 ∼ 𝑂 0, 𝜏2 ▪ etc. In ANOVA and regression analysis, the statistical model must be stated as a “step 0” ▪ in other cases, you may leave it out THE STATISTICAL MODEL

SLIDE 29

23 March 2015, Q1i-k OLD EXAM QUESTION

SLIDE 30