Business Statistics CONTENTS Comparing two s Comparing more than - - PowerPoint PPT Presentation

โ–ถ
business statistics
SMART_READER_LITE
LIVE PREVIEW

Business Statistics CONTENTS Comparing two s Comparing more than - - PowerPoint PPT Presentation

SEVERAL S: COMPARISON Business Statistics CONTENTS Comparing two s Comparing more than two s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further


slide-1
SLIDE 1

SEVERAL ๐œˆS: COMPARISON

Business Statistics

slide-2
SLIDE 2

Comparing two ๐œˆs Comparing more than two ๐œˆs Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further study CONTENTS

slide-3
SLIDE 3

Recall the comparison of the means of two independent samples (๐‘

1 and ๐‘ 2) with the ๐‘ข-test:

โ–ช ๐ผ0: ๐œˆ1 = ๐œˆ2 โ–ช under ๐ผ0:

๐‘

1โˆ’๐‘ 2

๐‘ก๐‘1โˆ’๐‘2

~๐‘ขdf, where df is the number of degrees

  • f freedom, depending on the assumption of the variances

โ–ช reject when ๐‘ขcalc =

๐‘ง1โˆ’๐‘ง2 ๐‘ก๐‘1โˆ’๐‘2

> ๐‘ขcrit Can we do this for three samples as well? โ–ช ๐ผ0: ๐œˆ1 = ๐œˆ2 = ๐œˆ3 COMPARING TWO ๐œˆS

We earlier wrote ๐‘Œ1 and ๐‘Œ2 or ๐‘Œ and ๐‘, so why not ๐‘

1 and ๐‘ 2?

slide-4
SLIDE 4

A first attempt: โ–ช think about how ๐ผ0: ๐œˆ1 = ๐œˆ2 leads to

๐‘

1โˆ’๐‘ 2

๐œdiff ~๐‘ขdf

โ–ช try if ๐ผ0: ๐œˆ1 = ๐œˆ2 = ๐œˆ3 leads to

๐‘

1โˆ’๐‘ 2โˆ’๐‘ 3

๐œdiff

~๐‘ขdf Very wrong! โ–ช every year a few students try to do this at their exam, but this is a dead end road to take! COMPARING MORE THAN TWO ๐œˆS

slide-5
SLIDE 5

A second attempt: โ–ช pairwise comparisons: ๐‘

1 vs. ๐‘ 2 and ๐‘ 2 vs ๐‘ 3

โ–ช ๐‘

1 vs. ๐‘ 3 not needed

โ–ช that implies doing two tests

โ–ช not one

For each test the probability of incorrectly rejecting ๐ผ0 is ๐›ฝ โ–ช so with 2 tests, it becomes 1 โˆ’ 1 โˆ’ ๐›ฝ 2 โ–ช example: ๐›ฝ = 0.05 gives 0.0975, so almost double So that will not work โ–ช think about testing 10 means (45 comparisons), the probability of a wrong decision becomes 90% COMPARING MORE THAN TWO ๐œˆS

slide-6
SLIDE 6

A third attempt: โ–ช We can partition the sum of squares into two parts:

โ–ช between the three groups โ–ช within each group

โ–ช Try to identify sources of variation in a numerical dependent variable ๐‘ (the response variable) โ–ช Variation in ๐‘ about its mean is partly explained by a categorical independent variable (the factor, with different levels)

โ–ช and partly unexplained (random error)

COMPARING MORE THAN TWO ๐œˆS

slide-7
SLIDE 7

Example โ–ช Chips defect rates are different in every batch, but are there systematic differences between manufacturers?

โ–ช numerical dependent variable: chip defect rate (๐‘) โ–ช categorical independent variable (one factor with four levels): manufacturer (1-4)

COMPARING MORE THAN TWO ๐œˆS

slide-8
SLIDE 8

Statistical model (formulation 1): โ–ช manufaturer 1: ๐‘

๐‘—1 = ๐œˆ1 + ๐œ๐‘—1

โ–ช ... โ–ช manufaturer 4: ๐‘

๐‘—4 = ๐œˆ4 + ๐œ๐‘—4

Where: โ–ช ๐‘ is the defect rate โ–ช ๐œˆ1 is the mean for manufacturer 1 โ–ช ... โ–ช ๐œˆ4 is the mean for manufacturer 4 โ–ช ๐œ is the random, unexplained, part Null hypothesis: โ–ช ๐œˆ1 = ๐œˆ2 = ๐œˆ3 = ๐œˆ4

COMPARING MORE THAN TWO ๐œˆS

๐œˆ1 = ๐œˆ2 = ๐œˆ3 = ๐œˆ4

slide-9
SLIDE 9

Statistical model (formulation 2): โ–ช manufaturer 1: ๐‘

๐‘—1 = ๐œˆ + ๐›ฝ1 + ๐œ๐‘—1

โ–ช ... โ–ช manufaturer 4: ๐‘

๐‘—4 = ๐œˆ + ๐›ฝ4 + ๐œ๐‘—4

Where โ–ช ๐‘ is the defect rate โ–ช ๐œˆ is the overall mean defect rate โ–ช ๐›ฝ1 is manufacturer 1โ€™s mean deviation from ๐œˆ โ–ช ... โ–ช ๐›ฝ4 is manufacturer 4โ€™s mean deviation from ๐œˆ โ–ช ๐œ is the random, unexplained, part Null hypothesis: โ–ช ๐›ฝ1 = ๐›ฝ2 = ๐›ฝ3 = ๐›ฝ4 = 0

COMPARING MORE THAN TWO ๐œˆS

๐œˆ + ๐›ฝ1 = โ‹ฏ = ๐œˆ + ๐›ฝ4

This โ€œgroup effectโ€ ๐›ฝ has nothing to do with the significance level ๐›ฝ!

slide-10
SLIDE 10

โ–ช What is the alternative hypothesis ๐ผ1?

โ–ช when ๐ผ0: ๐œˆ1 = ๐œˆ2 = ๐œˆ3 = ๐œˆ4

โ–ช Formulation 1:

โ–ช wrong: ๐ผ1: ๐œˆ1 โ‰  ๐œˆ2 โ‰  ๐œˆ3 โ‰  ๐œˆ4 โ–ช correct: ๐ผ1: ๐‘œ๐‘๐‘ข ๐œˆ1 = ๐œˆ2 = ๐œˆ3 = ๐œˆ4 โ–ช or: at least one of the ๐œˆs differs from the other ๐œˆs

COMPARING MORE THAN TWO ๐œˆS

๐œˆ1 โ‰  ๐œˆ2 โ‰  ๐œˆ3 โ‰  ๐œˆ4 ๐œˆ1 = ๐œˆ2 = ๐œˆ3 โ‰  ๐œˆ4

slide-11
SLIDE 11

โ–ช What is the alternative hypothesis ๐ผ1?

โ–ช when ๐ผ0: ๐›ฝ1 = ๐›ฝ2 = ๐›ฝ3 = ๐›ฝ4 = 0

โ–ช Formulation 2:

โ–ช wrong: ๐ผ1: ๐›ฝ1 โ‰  ๐›ฝ2 โ‰  ๐›ฝ3 โ‰  ๐›ฝ4 โ‰  0 โ–ช correct: ๐ผ1: ๐‘œ๐‘๐‘ข ๐›ฝ1 = ๐›ฝ2 = ๐›ฝ3 = ๐›ฝ4 = 0 โ–ช or: at least one of the ๐›ฝs differs from 0

COMPARING MORE THAN TWO ๐œˆS

๐›ฝ1 โ‰  ๐›ฝ2 โ‰  ๐›ฝ3 โ‰  ๐›ฝ4 ๐›ฝ1 = ๐›ฝ2 = ๐›ฝ3 โ‰  ๐›ฝ4

slide-12
SLIDE 12

We want to investigate possible differences in mean income in Atlanta, Boston, Chicago and Detroit.

  • a. What is the null hypothesis?
  • b. Suppose the null hypothesis is rejected. What can you

conclude? EXERCISE 1

slide-13
SLIDE 13

โ–ช Define notation:

โ–ช ๐‘ง is the numerical value (e.g., chip defect rate) โ–ช ๐‘ง๐‘—๐‘˜ is the value for observation #๐‘— within treatment #๐‘˜ (e.g., machine #๐‘˜) โ–ช เดค ๐‘งโˆ™๐‘˜ is the average over all observations (๐‘— = 1, โ€ฆ , ๐‘œ๐‘˜) within treatment #๐‘˜ โ–ช เดค เดค ๐‘งโˆ™โˆ™ is the average over all observations within all treatments (๐‘˜ = 1, โ€ฆ , ๐‘‘)

โ–ช Analysis of variance (ANOVA) model ๐‘

๐‘—๐‘˜ = ๐œˆ๐‘˜ + ๐œ๐‘—๐‘˜

  • r ๐‘

๐‘—๐‘˜ = ๐œˆ + ๐›ฝ๐‘˜ + ๐œ๐‘—๐‘˜

ANALYSIS OF VARIANCE

Observe the position of the

  • dots. A dot tells that index

has been averaged over

slide-14
SLIDE 14

Compare variation within groups to variation between groups โ–ช variation within group #๐‘˜:

โ–ช ๐‘‡๐‘‡๐‘‹

๐‘˜ = ฯƒ๐‘—=1 ๐‘œ๐‘˜

๐‘ง๐‘—๐‘˜ โˆ’ เดค ๐‘งโˆ™๐‘˜

2

โ–ช variation within all groups ๐‘˜ = 1, โ€ฆ , ๐‘‘:

โ–ช ๐‘‡๐‘‡๐‘‹ = ฯƒ๐‘˜=1

๐‘‘

๐‘‡๐‘‡๐‘‹

๐‘˜ =

โ–ช ฯƒ๐‘˜=1

๐‘‘

ฯƒ๐‘—=1

๐‘œ๐‘˜

๐‘ง๐‘—๐‘˜ โˆ’ เดค ๐‘งโˆ™๐‘˜

2

โ–ช variation between the ๐‘‘ groups

โ–ช so due to the ๐›ฝs: โ–ช ๐‘‡๐‘‡๐ต = ฯƒ๐‘˜=1

๐‘‘

๐‘œ๐‘˜ เดค ๐‘งโˆ™๐‘˜ โˆ’ เดค เดค ๐‘งโˆ™โˆ™

2

ANALYSIS OF VARIANCE

So ๐‘‡๐‘‡๐ต is the variation around the mean เดค ๐‘งโˆ™โˆ™ that is explained by the model, by factor โ€œAโ€

slide-15
SLIDE 15

Together, ๐‘‡๐‘‡๐ต and ๐‘‡๐‘‡๐‘‹ make up the total variation โ–ช variation in entire data set:

โ–ช ๐‘‡๐‘‡๐‘ˆ = ฯƒ๐‘˜=1

๐‘‘

ฯƒ๐‘—=1

๐‘œ๐‘˜

๐‘ง๐‘—๐‘˜ โˆ’ เดค เดค ๐‘งโˆ™โˆ™

2

โ–ช so

โ–ช ๐‘‡๐‘‡๐‘ˆ = ๐‘‡๐‘‡๐ต + ๐‘‡๐‘‡๐‘‹

โ–ช Think about the logic: we are comparing several means by comparing two variances

โ–ช analysis of variance is used to compare ๐œˆ1, ๐œˆ2, โ€ฆ , ๐œˆ๐‘‘

ANALYSIS OF VARIANCE

So ๐‘‡๐‘‡๐‘ˆ is the total variation around the grand mean เดฅ ๐‘งโˆ™โˆ™ ๐‘‡๐‘‡๐‘ˆ ๐‘‡๐‘‡๐ต ๐‘‡๐‘‡๐น

slide-16
SLIDE 16

ANALYSIS OF VARIANCE

slide-17
SLIDE 17

Source of variation ๐‘ป๐‘ป ๐ž๐  ๐‘ต๐‘ป ๐‘ฎโˆ’ratio between groups (due to factor โ€œAโ€) ๐‘‡๐‘‡๐ต ๐‘‘ โˆ’ 1 ๐‘๐‘‡๐ต = ๐‘‡๐‘‡๐ต ๐‘‘ โˆ’ 1 ๐บ = ๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ within groups ๐‘‡๐‘‡๐‘‹ ๐‘œ โˆ’ ๐‘‘ ๐‘๐‘‡๐‘‹ = ๐‘‡๐‘‡๐‘‹ ๐‘œ โˆ’ ๐‘‘ total ๐‘‡๐‘‡๐‘ˆ ๐‘œ โˆ’ 1

ANALYSIS OF VARIANCE

slide-18
SLIDE 18

We sample from the four cities incomes from 100 persons (30 from Atlanta, 20 from Boston, 25 from Chicago and Detroit).

  • a. What is ๐‘œ and ๐‘‘ in the previous scheme?
  • b. Specify ๐‘

๐‘—๐‘˜ = ๐œˆ๐‘˜ + ๐œ๐‘—๐‘˜ for the case of the 8th respondent

from Chicago. EXERCISE 2

slide-19
SLIDE 19

What do we test? โ–ช ๐ผ0: ๐œˆ1 = ๐œˆ2 = โ‹ฏ = ๐œˆ๐‘‘ โ–ช or equivalently ๐ผ0: ๐›ฝ1 = ๐›ฝ2 = โ‹ฏ = ๐›ฝ๐‘‘ = 0 How do we test? โ–ช by comparing ๐‘‡๐‘‡๐ต and ๐‘‡๐‘‡๐‘‹ โ–ช or equivalently ๐‘๐‘‡๐ต and ๐‘๐‘‡๐‘‹ (which are variances!) โ–ช if ๐ผ0 is true, ๐‘๐‘‡๐ต and ๐‘๐‘‡๐‘‹ are expected to be equal โ–ช their ratio is the test statistic: ๐บ = ๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ TESTING SIGNIFICANCE OF ANOVA

if this ratio is large, the group averages are likely to differ

slide-20
SLIDE 20

The test statistic ๐บ โ–ช is likely to be around 1 if ๐ผ0 is true โ–ช is likely to be much larger than 1 if ๐ผ1 is true โ–ช has a sampling distribution ๐บ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘ under ๐ผ0 Here ๐บdf1,df2 is the ๐บ-distribution with df1 and df2 degrees

  • f freedom

Reject for large values of ๐บ =

๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ only

TESTING SIGNIFICANCE OF ANOVA

because we only reject ๐ผ0 if variations between groups are larger than expected under ๐ผ0

slide-21
SLIDE 21

So step 3 becomes: โ–ช under ๐ผ0, ๐บ =

๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ โˆผ ๐บ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘

โ–ช under the assumption: ๐œ๐‘—๐‘˜ โˆผ ๐‘‚ 0, ๐œ2 In other words, the assumptions of ANOVA are: โ–ช the observations ๐‘ง๐‘—๐‘˜ should be independent โ–ช the sub-populations should be normally distributed โ–ช the sub-populations should have equal variances Fortunately, ANOVA is somewhat robust to โ–ช departures from normality and โ–ช the equal variance assumptions TESTING SIGNIFICANCE OF ANOVA

slide-22
SLIDE 22

Five step procedure for ANOVA โ–ช Step 1:

โ–ช ๐ผ0: ๐›ฝ1 = ๐›ฝ2 = โ‹ฏ = ๐›ฝ๐‘‘ = 0 ; ๐ผ1: not ๐ผ0 ; ๐›ฝ = 0.05

โ–ช Step 2:

โ–ช sample statistic: ๐บ =

๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ ; reject for large values

โ–ช Step 3:

โ–ช under ๐ผ0: ๐บ =

๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ โˆผ ๐บ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘

โ–ช requirement: normal populations with equal variance

โ–ช Step 4:

โ–ช calculate ๐บcrit = ๐บupper;df1,df2,๐›ฝ โ–ช

  • r calculate ๐‘žโˆ’value = ๐‘„ ๐บ โ‰ฅ ๐บcalc

โ–ช Step 5

โ–ช reject ๐ผ0 if ๐บcalc > ๐บcrit โ–ช

  • r reject ๐ผ0 if ๐‘žโˆ’value < ๐›ฝ

TESTING SIGNIFICANCE OF ANOVA

slide-23
SLIDE 23

Rejecting ๐ผ0: ๐›ฝ1 = ๐›ฝ2 = โ‹ฏ = ๐›ฝ๐‘‘ = 0 โ–ช is equivalent to rejecting ๐ผ0: ๐œˆ1 = ๐œˆ2 = โ‹ฏ = ๐œˆ๐‘‘ โ–ช means that at least one of the ๐›ฝs is not 0 โ–ช means that at least one of the ๐œˆs differs from another ๐œˆ โ–ช means that there is a โ€œfactor effectโ€ or โ€œtreatment effectโ€ โ–ช that we donโ€™t know (yet) which of the groups is or are significantly different โ–ช that we donโ€™t know (yet) if the differing group are significanctly lower of higher

โ–ช we need a โ€œpost-hocโ€ test to find out which groups are different and in which direction

TESTING SIGNIFICANCE OF ANOVA

slide-24
SLIDE 24

Example โ–ช Context:

โ–ช you want to see if three different golf clubs yield different distances โ–ช you randomly select five measurements from trials on an automated driving machine for each club โ–ช at the 0.05 significance level, is there a difference in mean distance?

PERFORMING ANOVA

slide-25
SLIDE 25

The three means are different But are they statistically (significanty) different? Here: โ–ช เดค เดค ๐‘

โˆ™โˆ™ = 232.1

โ–ช เดค ๐‘

โˆ™1 = 249.2

โ–ช เดค ๐‘

โˆ™2 = 226.0

โ–ช เดค ๐‘

โˆ™3 = 221.0

PERFORMING ANOVA

slide-26
SLIDE 26

Organizing the data in SPSS โ–ช independent samples, so not โ–ช but rather โ–ช and then PERFORMING ANOVA

slide-27
SLIDE 27

PERFORMING ANOVA

๐‘‡๐‘‡๐ต ๐‘‡๐‘‡๐‘‹ ๐‘‡๐‘‡๐‘ˆ = ๐‘‡๐‘‡๐ต + ๐‘‡๐‘‡๐‘‹ ๐‘๐‘‡ = ๐‘‡๐‘‡ ๐‘’๐‘” ๐บ = ๐‘๐‘‡๐ต ๐‘๐‘‡๐‘‹ ๐‘ž-value (one-tailed)

slide-28
SLIDE 28

In the 5-step procedure the underlying model is not mentioned But it can be useful to mention it โ–ช โ€œStep 0โ€: ๐‘

๐‘—๐‘˜ = ๐œˆ + ๐›ฝ๐‘˜ + ๐œ๐‘—๐‘˜ with ๐œ๐‘—๐‘˜ โˆผ ๐‘‚ 0, ๐œ2

In fact, we could also do that in our previous tests: โ–ช one-sample ๐œˆ: ๐‘Œ๐‘— = ๐œˆ + ๐œ๐‘— with ๐œ๐‘— โˆผ ๐‘‚ 0, ๐œ2 โ–ช two-sample ๐œˆ: ๐‘

๐‘—1 = ๐œˆ1 + ๐œ๐‘— and ๐‘ ๐‘—2 = ๐œˆ2 + ๐œ๐‘— with

๐œ๐‘— โˆผ ๐‘‚ 0, ๐œ2 โ–ช etc. In ANOVA and regression analysis, the statistical model must be stated as a โ€œstep 0โ€ โ–ช in other cases, you may leave it out THE STATISTICAL MODEL

slide-29
SLIDE 29

23 March 2015, Q1i-k OLD EXAM QUESTION

slide-30
SLIDE 30

Doane & Seward 5/E 11.1-11.2 Tutorial exercises week 4 full ANOVA, idea of ANOVA, FURTHER STUDY