Lecture 16. Linear model with a classifying factor 2020 (1) A - - PowerPoint PPT Presentation

lecture 16 linear model with a classifying factor 2020 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 16. Linear model with a classifying factor 2020 (1) A - - PowerPoint PPT Presentation

Lecture 16. Linear model with a classifying factor 2020 (1) A simple anova (2) Two breeds of sheep Black Welsh A measurement is taken on a 6.6 10.4 random sample of ve animals 8.1 9.8 from each of two breeds. Is 7.6 11.0 there


slide-1
SLIDE 1

Lecture

  • 16. Linear model with a classifying factor

2020

slide-2
SLIDE 2

(1) A simple anova

slide-3
SLIDE 3

(2) Two breeds of sheep

Black Welsh 6.6 10.4 8.1 9.8 7.6 11.0 6.9 10.6 8.3 9.2 A measurement is taken on a random sample of ˛ve animals from each of two breeds. Is there evidence for a di¸erence between breeds?

slide-4
SLIDE 4

(3) An indicator variable

Breed X Y Black 6.6 Black 8.1 Black 7.6 Black 6.9 Black 8.3 Welsh 1 10.4 Welsh 1 9.8 Welsh 1 11.0 Welsh 1 10.6 Welsh 1 9.2 X is an indicator variable for the Welsh breed.

slide-5
SLIDE 5

(4) An indicator variable

Set X = 0 for each measurement on a Blackface animal and X = 1 for each measurement on a Welsh

  • animal. The linear model E(Y ) = b0 + b1X assigns

means to breeds as follows: Breed X b0 + b1X Black b0 Welsh 1 b0 + b1 b0 : mean value for Blackface breed b1 : di¸erence between Welsh and Blackface Null hypothesis b1 = 0 is equivalent to ’no di¸erences between breeds’.

slide-6
SLIDE 6

(5) Using lm (with indicator variable)

Y <- 0.1 * c(66,81,76,69,83,104,98,110,106,92) X <- rep(0:1, each = 5) fit <- lm(Y ˜ X) summary(fit) anova(fit)

t statistic (from summary) or F statistic (from anova) tests b1 = 0 (no di¸erence between breeds).

slide-7
SLIDE 7

(6) Using lm (with factor)

Y <- 0.1 * c(66,81,76,69,83,104,98,110,106,92) Breed <- gl(2,5, labels = c(’Black’,’Welsh’)) fit <- lm(Y ˜ Breed) summary(fit)

t statistic (from summary) or F statistic (from anova) tests b1 = 0 (no di¸erence between breeds). Output is identical to that obtained using an indicator variable (apart from labelling of estimates).

slide-8
SLIDE 8

(7) Output from summary and anova

Output from summary: Estimate SE t Intercept 7.5 0.3233 23.201 BreedWelsh 2.7 0.4572 5.906 Output from anova: DF SSQ MSQ F Breed 1 18.225 18.2250 34.88 Residuals 8 4.180 0.5225

slide-9
SLIDE 9

(8) Three breeds of sheep

Three breeds of sheep: Blackface, Welsh, and the Blackface ˆ Welsh cross. A measurement is taken on a random sample of ˛ve animals from each breed. Is there evidence for a di¸erence between breeds? Black Welsh Cross 6.6 10.4 7.0 8.1 9.8 9.3 7.6 11.0 8.4 6.9 10.6 7.6 8.3 9.2 9.7

slide-10
SLIDE 10

(9) Two indicator variables

X and Z are indicators for the Welsh and Cross

  • breeds. The linear model E(Y ) = b0 + b1X + b2Z

assigns means to breeds as follows: Breed X Z b0 + b1X + b2Z Black b0 Welsh 1 b0 + b1 Cross 1 b0 + b2 b0 : mean value for Blackface breed b1 : di¸erence between Welsh and Blackface b2 : di¸erence between Cross and Blackface Null hypothesis b1 = b2 = 0 is equivalent to ’no di¸erences among breeds’.

slide-11
SLIDE 11

(10) Parameter estimates

Parameter estimates: ^ b0 = — YB; ^ bW = — YW ` — YB; ^ bC = — YC ` — YB Fitted value ^ b0 + ^ bW X + ^ bCZ is — YB = ^ b0 (Blackface observation) — YW = ^ b0 + ^ bW (Welsh observation) — YC = ^ b0 + ^ bC (Cross observation) Fitted value for an observation is the mean

  • bservation for its breed.
slide-12
SLIDE 12

(11) Sums of squares

Anova equation (for multiple regression):

X

(Y ` — Y )2 =

X

(^ Y ` — Y )2 +

X

(Y ` ^ Y )2 Regression sum of squares (SB) is the mean-corrected sum of squares of the three ˛tted values, weighted by the size of sample. Residual sum of squares (SW ) is the sum of the mean-corrected sums of squares within each breed.

slide-13
SLIDE 13

(12) Anova table

With k groups, and a total of N = nk observations, regression and residual sums of squares have k`1 and N`k d.f. Anova table is calculated in the usual way: Source DF SSQ MSQ F Between k ` 1 SB MB MB=MW Within N ` k SW MW MW (previously S2) estimates the residual variance.

slide-14
SLIDE 14

(13) The ’sheep’ data frame

Breed Cu X Z Black 6.6 Black 8.1 Black 7.6 Black 6.9 Black 8.3 Welsh 10.4 1 Welsh 9.8 1 Welsh 11.0 1 Welsh 10.6 1 Welsh 9.2 1 Cross 7.0 1 Cross 9.3 1 Cross 8.4 1 Cross 7.6 1 Cross 9.7 1

slide-15
SLIDE 15

(14) Using lm (with factor)

library(sda) # if necessary fit <- lm(Cu ˜ Breed, data = sheep) anova(fit)

Alternatively:

fit <- aov(Cu ˜ Breed) summary(fit)

slide-16
SLIDE 16

(15) ANOVA

Anova table for the sheep data is Source DF SSQ MSQ F Between breeds 2 18.90 9.450 12.22 Within breeds 12 9.28 0.773 Total 14 28.18 F = 12:22 on 2 and 12 d.f. (P < 0:01). Di¸erences between breeds are established beyond reasonable doubt.

slide-17
SLIDE 17

END OF LECTURE

slide-18
SLIDE 18

Lecture

  • 17. Comparisons among means

2020

slide-19
SLIDE 19

(16) ANOVA

Anova table for the sheep data is Source DF SSQ MSQ F Between breeds 2 18.90 9.450 12.22 Within breeds 12 9.28 0.773 Total 14 28.18 F = 12:22 on 2 and 12 d.f. (P < 0:01). Di¸erences between breeds are established beyond reasonable doubt.

slide-20
SLIDE 20

(17) Comparing two means

Mean values for each breed: Black Welsh Cross 7.5 10.2 8.4 Estimated standard error of a di¸erence between two means based on n1 and n2 observations is

q

MW (1=n1 + 1=n2) where MW is the within-group mean square. For the sheep data, MW = 0:773, n1 = n2 = 5, and standard error of di¸erence between any two breed means is 0.556.

slide-21
SLIDE 21

(18) Output from summary

Estimate Std.Error t.value (Intercept) 7.5000 0.3933 19.071 BreedCross 0.9000 0.5562 1.618 BreedWelsh 2.7000 0.5562 4.855

slide-22
SLIDE 22

(19) Comparing two means

Comparison Estimate SE t Welsh ` Blackface 2.7 0.556 4.86 Cross ` Blackface 0.9 0.556 1.62 Welsh ` Cross 1.8 0.556 3.24 (Upper 2.5% point of t on 12 d.f. is 2.179). A 95% con˛dence interval for Welsh ` Blackface is 2.7 ˚ 2.179 ˆ 0.556, or (1.49, 3.91).

slide-23
SLIDE 23

(20) More general comparisons

The ’contrast’ C = a1 — Y1 + a2 — Y2 + ´ ´ ´ + ak — Yk has estimated standard error

v u u u tMW @ a2 1

n1 + a2

2

n2 + ´ ´ ´ + a2

k

nk

1 A

Under H0: E(C) = 0, the statistic obtained by dividing C by its estimated standard error has an t distn with N ` k degrees of freedom. Usually a1 + a2 + ´ ´ ´ + ak = 0.

slide-24
SLIDE 24

(21) Example of a contrast

A possible contrast of interest is 0:5 ˜ (— YB + — YW ) ` — YC. Value of this contrast is 0:5 ˆ (7:5 + 10:2) ` 8:4 = 0:45, with estimated standard error

q

0:773(1=20 + 1=20 + 1=5) = 0:48155. T statistic is 0:45=0:48155 = 0:93 with 12 d.f. Test result is not signi˛cant. A 95% interval estimate for the contrast is 0:45 ˚ 2:179 ˆ 0:48155, or (`0:60; +1:50).

slide-25
SLIDE 25

(22) Fruit ‚ies

Bristle counts on 20 fruit ‚ies, ˛ve ‚ies in each of four genotype classes. A B C D 16 8 9 5 12 12 12 8 16 7 11 8 11 9 10 7 15 12 8 9 genotype class means A B C D (14.0) (9.6) (10.0) (7.4)

slide-26
SLIDE 26

(23) Fruit ‚ies

ANOVA of the fruit ‚y data shows highly signi˛cant di¸erences among genotype classes (P < 0.001). We could continue the analysis by producing a table

  • f six pairwise comparisons, A versus B, etc, but this

would not shed much light on the data. Additional information on the genotype classes allows a more informative analysis.

slide-27
SLIDE 27

(24) Cy and Me mutations

The key to a more informative analysis is knowing that genotype classes A { D are determined by presence (+) or absence (`) of two mutations, Cy and Me. A B C D Cy ` ` + + Me ` + ` +

slide-28
SLIDE 28

(25) Informative contrasts

A B C D Cy ` ` + + Me ` + ` + CyˆMe + ` ` + (C`A) estimates the Cy e¸ect when Me is absent, (D`B) estimates the Cy e¸ect when Me is present. The Cy contrast estimates the sum (or average) of these two conditional e¸ects. (C`A+B`D) estimates the di¸erence between the two conditional e¸ects (the interaction CyˆMe).

slide-29
SLIDE 29

(26) An interaction plot

mutation Cy number of bristles absent present 8 10 12 14 mutation Me absent present

slide-30
SLIDE 30

(27) Anova with two factors

Instead of setting up one factor (genotype) with four levels, set up two factors (Cy and Me), each with two levels (’present’, ’absent’). Anova based on bristles ‰ Cy + Me + Cy:Me has three single d.f. sums of squares, for the average Cy e¸ect, the average Me e¸ect, and the interaction. F ratios are the squares of t statistics obtained from contrasts, and the three sums of squares add up to the ’genotypes’ sum of squares with 3 d.f. Model formula can also be written bristles ‰ Cy * Me

slide-31
SLIDE 31

END OF LECTURE

slide-32
SLIDE 32

Lecture

  • 18. The random e¸ects model

2020

slide-33
SLIDE 33

(28) Random e¸ects model

As previously we have k groups of size n, total number of observations N = nk. Random e¸ects model: Yi = m + Ur + ei where r is the group which contains observation i. U1 : : : Uk and e1 : : : eN are independent r.v.s normally distributed with zero mean and var(U) = ff2

B;

var(e) = ff2

W :

Total variance of a single observation ff2

B + ff2 W is

partitioned into components ff2

B and ff2 W .

slide-34
SLIDE 34

(29) Expected mean squares

Source DF MSQ E(MSQ) Between groups k ` 1 MB ff2

W + nff2 B

Within groups N ` k MW ff2

W

F = MB=MW tests H0 : ff2

B = 0.

— Y estimates m, with variance (ff2

W + nff2 B)=N.

Estimated standard error is E =

q

MB=N. Interval estimate for m is — Y ˚ kE where k is an upper quantile of t with k ` 1 d.f.

slide-35
SLIDE 35

(30) Inference about the mean m

Why does the standard error of — Y depend on MB rather than MW ? Why does the t quantile for the calculation of the interval estimate have k ` 1 rather than N ` k d.f? Suppose we calculate means — Y1 : : : — Yk for each group. Each mean has variance (ff2

W + nff2 B)=n. To estimate

m, we use the average of the group means. The variance of this estimate is (ff2

W + nff2 B)=(nk).

The estimate of ff2

W + nff2 B is MB, which is

proportional to the corrected sum of squares of the k group means, and therefore has k ` 1 d.f.

slide-36
SLIDE 36

(31) Estimating variance components

DF MSQ E(MSQ) Between groups k ` 1 MB ff2

W + nff2 B

Within groups N ` k MW ff2

W

Estimates of ff2

W and ff2 B are obtained by equating

  • bserved and expected mean squares:

^ ff2

W = MW ;

^ ff2

B = (MB ` MW )=n:

(But if MB < MW , ^ ff2

B = 0)

slide-37
SLIDE 37

(32) Intraclass correlation

The intraclass correlation is ff2

B=(ff2 B + ff2 W ).

It is the correlation between two observations from the same group resulting from the shared random component (U). In genetic studies, the group is a family of sibs, and the intraclass correlation is proportional to the heritability of the trait. Sib relationship Correlation Identical twins h2 Full sibs 0:5h2 Half sibs 0:25h2

slide-38
SLIDE 38

(33) Finger ridges in identical twins

Family 1 71 71 2 79 82 3 105 99 4 115 114 5 76 70 6 83 82 7 114 113 8 57 44 9 114 113 10 94 91 11 75 83 12 76 72

slide-39
SLIDE 39

(34) R code (˛nger ridges)

Calculation of sums of squares is done in the same way as for the sheep data, except now we have k = 12 groups, each of size n = 2. Here is the R code:

library(sda) # if required fit <- aov(ridges ˜ family, data = fingers) summary(fit)

The ’˛ngers’ data frame has 24 rows and two variables, a numeric variable ’ridges’ and a factor ’family’ with 12 levels.

slide-40
SLIDE 40

(35) anova table (˛nger ridges)

Source DF Sum Sq Mean Sq F Between families 11 8990.5 817.31 57.19 Within families 12 171.5 14.29 Total 23 9162.0 F test is very highly signi˛cant P < 0.001. MeanSq E(MeanSq) Between families 817.31 ff2

W + 2ff2 B

Within families 14.29 ff2

W

Here, ^ ff2

W = 14:29, ^

ff2

B = 401:51.

slide-41
SLIDE 41

(36) Fixed and random e¸ects

In the ‘sheep’ example, di¸erences among breeds are regarded as ˛xed e¸ects. Examples of factors usually treated as ˛xed are treatments, sex, breed. In the ‘twins’ example, di¸erences between particular families are not of interest. Family means as assumed to be sampled from an N(m; ff2

B) distn.

Examples of random e¸ects are families, litters, or sires, where the families (for example) are a sample of those possible. A repeat experiment would use di¸erent families.

slide-42
SLIDE 42

(37) A mixed model

The model may include ˛xed and random e¸ects. A sample of 7 lambs from each of 6 farms were measured for plasma Cu. Three farms (A, B, C) are located on loam soils and three (D, E, F) on peaty soils. Does soil type a¸ect the Cu measurement?

slide-43
SLIDE 43

(38) A mixed model

If we treat farms as ˛xed e¸ects, then the anova is a standard one with k = 6 farms and constant sample size n = 7. The question about soil type is answered by testing the contrast between farms A, B, C and farms D, E, F. In this case our conclusion will apply narrowly to these speci˛c farms.

slide-44
SLIDE 44

(39) A mixed model

If we treat farm e¸ects as random, we have a mixed model with one ˛xed e¸ect (soil type) and one random e¸ect (farm). Farms A, B, C are regarded as a random sample of farms on loamy soil, farms D, E, F as a random sample of farms on peaty soil. Our conclusion about the e¸ect of soil type will apply widely to these two populations of farms. The analysis is equivalent to a two-stage approach in which we calculate a mean for each of the six farms, then do a standard anova on these six measurements with soil as the grouping factor (k = 2, n = 3).

slide-45
SLIDE 45

(40) Analysing the farms data

library(sda) # if required fit <- aov(Cu ˜ soil + Error(farm), data = farms) summary(fit)

The Error function ensures that farm is treated as a random e¸ect. The anova table is split into two ’strata’ (between and within farm). The soil type comparison is part of the between-farm stratum. Here there are no ˛xed e¸ect comparisons in the within-farm stratum.

slide-46
SLIDE 46

(41) Unbalanced data

Using ANOVA to ˛t a mixed model is easy when data are balanced, as in the ’farms’ example. An alternative is to estimate variance components by maximum likelihood. The standard R function for this is lme (part of the nlme package). Maximum likelihood will be covered in week 8 (although we don’t use it to estimate variance components).

slide-47
SLIDE 47

(42) Using lme

Here is R code to analyse the ˛nger ridge data with the lme function.

library(sda) # if required library(nlme) fit <- lme(ridges ˜ 1, random = ˜ 1|family, data = fingers) summary(fit)

(If you are feeling con˛dent you could also try using lme to analyse the farms data.)

slide-48
SLIDE 48

(43) Output from lme

Linear mixed-e¸ects model ˛t by REML Random e¸ects: Formula: ‰ 1j family (Intercept) Residual StdDev: 20.03774 3.780433 Fixed e¸ects: ridges ‰ 1 Value Std.Error (Intercept) 87.20833 5.835644 Number of Observations: 24 Number of Groups: 12