[PPT] - Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell PowerPoint Presentation

SLIDE 1

Gov 2000: 10. Multiple Regression in Matrix Form

Matthew Blackwell

Fall 2016

1 / 64

SLIDE 2

1. Matrix algebra review
2. Matrix Operations
3. Linear model in matrix form
4. OLS in matrix form
5. OLS inference in matrix form

2 / 64

SLIDE 3

Where are we? Where are we going?

Last few weeks: regression estimation and inference with one

and two independent variables, varying efgects

This week: the general regression model with arbitrary

covariates

Next week: what happens when assumptions are wrong

3 / 64

SLIDE 4

Nunn & Wantchekon

Are there long-term, persistent efgects of slave trade on

Africans today?

Basic idea: compare levels of interpersonal trust (𝑍𝑗) across

difgerent levels of historical slave exports for a respondent’s ethnic group

Problem: ethnic groups and respondents might difger in their

interpersonal trust in ways that correlate with the severity of slave exports

One solution: try to control for relevant difgerences between

groups via multiple regression

4 / 64

SLIDE 5

Nunn & Wantchekon

Whaaaaa?

Bold letter, quotation marks, what is this?

Today’s goal is to decipher this type of writing

5 / 64

SLIDE 6

Multiple Regression in R

nunn <- foreign::read.dta("../data/Nunn_Wantchekon_AER_2011.dta") mod <- lm(trust_neighbors ~ exports + age + male + urban_dum + malaria_ecology, data = nunn) summary(mod) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.5030370 0.0218325 68.84 <2e-16 *** ## exports

0.0010208

0.0000409

24.94

<2e-16 *** ## age 0.0050447 0.0004724 10.68 <2e-16 *** ## male 0.0278369 0.0138163 2.01 0.044 * ## urban_dum

0.2738719

0.0143549

19.08

<2e-16 *** ## malaria_ecology 0.0194106 0.0008712 22.28 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.978 on 20319 degrees of freedom ## (1497 observations deleted due to missingness) ## Multiple R-squared: 0.0604, Adjusted R-squared: 0.0602 ## F-statistic: 261 on 5 and 20319 DF, p-value: <2e-16

6 / 64

SLIDE 7

Why matrices and vectors?

7 / 64

SLIDE 8

8 / 64

SLIDE 9

Why matrices and vectors?

Here’s one way to write the full multiple regression model:

𝑧𝑗 = 𝛾0 + 𝑦𝑗1𝛾1 + 𝑦𝑗2𝛾2 + ⋯ + 𝑦𝑗𝑙𝛾𝑙 + 𝑣𝑗

Notation is going to get needlessly messy as we add variables.
Matrices are clean, but they are like a foreign language.
You need to build intuitions over a long period of time.

9 / 64

SLIDE 10

Quick note about interpretation

𝑧𝑗 = 𝛾0 + 𝑦𝑗1𝛾1 + 𝑦𝑗2𝛾2 + ⋯ + 𝑦𝑗𝑙𝛾𝑙 + 𝑣𝑗

In this model, 𝛾1 is the efgect of a one-unit change in 𝑦𝑗1

conditional on all other 𝑦𝑗𝑘.

Jargon “partial efgect,” “ceteris paribus,” “all else equal,”

“conditional on the covariates,” etc

Notation change: lower-case letters here are random variables.

10 / 64

SLIDE 11

1/ Matrix algebra review

11 / 64

SLIDE 12

Vectors

A vector is just list of numbers (or random variables).
A 1 × 𝑙 row vector has these numbers arranged in a row:

𝐜 = [ 𝑐1 𝑐2 𝑐3 ⋯ 𝑐𝑙 ]

A 𝑙 × 1 column vector arranges the numbers in a column:

𝐛 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝑏1 𝑏2 ⋮ 𝑏𝑙 ⎤ ⎥ ⎥ ⎥ ⎦

Convention we’ll assume that a vector is column vector and

vectors will be written with lowercase bold lettering (𝐜)

12 / 64

SLIDE 13

Vector examples

Vector of all covariates for a particular unit 𝑗:

𝐲𝑗 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 𝑦𝑗1 𝑦𝑗2 ⋮ 𝑦𝑗𝑙 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

For the Nunn-Wantchekon data, we might have:

𝐲𝑗 = ⎡ ⎢ ⎢ ⎢ ⎣ 1 exports𝑗 age𝑗 male𝑗 ⎤ ⎥ ⎥ ⎥ ⎦

13 / 64

SLIDE 14

Matrices

A matrix is just a rectangular array of numbers.
We say that a matrix is 𝑜 × 𝑙 (“𝑜 by 𝑙”) if it has 𝑜 rows and 𝑙

columns.

Uppercase bold denotes a matrix:

𝐁 = ⎡ ⎢ ⎢ ⎢ ⎣ 𝑏11 𝑏12 ⋯ 𝑏1𝑙 𝑏21 𝑏22 ⋯ 𝑏2𝑙 ⋮ ⋮ ⋱ ⋮ 𝑏𝑜1 𝑏𝑜2 ⋯ 𝑏𝑜𝑙 ⎤ ⎥ ⎥ ⎥ ⎦

Generic entry: 𝑏𝑗𝑘 where this is the entry in row 𝑗 and column 𝑘

14 / 64

SLIDE 15

Examples of matrices

One example of a matrix that we’ll use a lot is the design

matrix, which has a column of ones, and then each of the subsequent columns is each independent variable in the regression. 𝐘 = ⎡ ⎢ ⎢ ⎢ ⎣ 1 exports1 age1 male1 1 exports2 age2 male2 ⋮ ⋮ ⋮ ⋮ 1 exports𝑜 age𝑜 male𝑜 ⎤ ⎥ ⎥ ⎥ ⎦

15 / 64

SLIDE 16

Design matrix in R

head(model.matrix(mod), 8) ## (Intercept) exports age male urban_dum malaria_ecology ## 1 1 855 40 28.15 ## 2 1 855 25 1 28.15 ## 3 1 855 38 1 1 28.15 ## 4 1 855 37 1 28.15 ## 5 1 855 31 1 28.15 ## 6 1 855 45 28.15 ## 7 1 855 20 1 28.15 ## 8 1 855 31 28.15 dim(model.matrix(mod)) ## [1] 20325 6

16 / 64

SLIDE 17

2/ Matrix Operations

17 / 64

SLIDE 18

Transpose

The transpose of a matrix 𝐁 is the matrix created by

switching the rows and columns of the data and is denoted 𝐁′.

𝑙th column of 𝐁 becomes the 𝑙th row of 𝐁′:

𝐁 = ⎡ ⎢ ⎢ ⎣ 𝑏11 𝑏12 𝑏21 𝑏22 𝑏31 𝑏32 ⎤ ⎥ ⎥ ⎦ 𝐁′ = [ 𝑏11 𝑏21 𝑏31 𝑏12 𝑏22 𝑏32 ]

If 𝐁 is 𝑜 × 𝑙, then 𝐁′ will be 𝑙 × 𝑜.
Also written 𝐁𝐔

18 / 64

SLIDE 19

Transposing vectors

Transposing will turn a 𝑙 × 1 column vector into a 1 × 𝑙 row

vector and vice versa: 𝐲𝑗 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 𝑦𝑗1 𝑦𝑗2 ⋮ 𝑦𝑗𝑙 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ 𝐲′

𝑗 = [ 1

𝑦𝑗1 𝑦𝑗2 ⋯ 𝑦𝑗𝑙 ]

19 / 64

SLIDE 20

Transposing in R

a <- matrix(1:6, ncol = 3, nrow = 2) a ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 t(a) ## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4 ## [3,] 5 6

20 / 64

SLIDE 21

Write matrices as vectors

A matrix is just a collection of vectors (row or column)
As a row vector:

𝐁 = [ 𝑏11 𝑏12 𝑏13 𝑏21 𝑏22 𝑏23 ] = [ 𝐛′

1

𝐛′

2

] with row vectors 𝐛′

1 = [ 𝑏11

𝑏12 𝑏13 ] 𝐛′

2 = [ 𝑏21

𝑏22 𝑏23 ]

Or we can defjne it in terms of column vectors:

𝐂 = ⎡ ⎢ ⎢ ⎣ 𝑐11 𝑐12 𝑐21 𝑐22 𝑐31 𝑐32 ⎤ ⎥ ⎥ ⎦ = [ 𝐜𝟐 𝐜𝟑 ] where 𝐜𝟐 and 𝐜𝟑 represent the columns of 𝐂.

𝑘 subscripts columns of a matrix: 𝐲𝑘
𝑗 and 𝑢 will be used for rows 𝐲′

𝑗.

21 / 64

SLIDE 22

Design matrix

Design matrix as a series of row vectors:

𝐘 = ⎡ ⎢ ⎢ ⎢ ⎣ 1 exports1 age1 male1 1 exports2 age2 male2 ⋮ ⋮ ⋮ ⋮ 1 exports𝑜 age𝑜 male𝑜 ⎤ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎣ 𝐲′

1

𝐲′

2

⋮ 𝐲′

𝑜

⎤ ⎥ ⎥ ⎥ ⎦

Design matrix as a series of column vectors:

𝐘 = [ 𝟐 𝐲1 𝐲2 ⋯ 𝐲𝑙 ]

22 / 64

SLIDE 23

Addition and subtraction

How do we add or subtract matrices and vectors?
First, the matrices/vectors need to be comformable, meaning

that the dimensions have to be the same.

Let 𝐁 and 𝐂 both be 2 × 2 matrices. Then, let 𝐃 = 𝐁 + 𝐂,

where we add each cell together:

𝐁 + 𝐂 = [ 𝑏11 𝑏12 𝑏21 𝑏22 ] + [ 𝑐11 𝑐12 𝑐21 𝑐22 ] = [ 𝑏11 + 𝑐11 𝑏12 + 𝑐12 𝑏21 + 𝑐21 𝑏22 + 𝑐22 ] = [ 𝑑11 𝑑12 𝑑21 𝑑22 ] = 𝐃

23 / 64

SLIDE 24

Scalar multiplication

A scalar is just a single number: you can think of it sort of

like a 1 by 1 matrix.

When we multiply a scalar by a matrix, we just multiply each

element/cell by that scalar:

𝑐𝐁 = 𝑐 [ 𝑏11 𝑏12 𝑏21 𝑏22 ] = [ 𝑐 × 𝑏11 𝑐 × 𝑏12 𝑐 × 𝑏21 𝑐 × 𝑏22 ]

24 / 64

SLIDE 25

3/ Linear model in matrix form

25 / 64

SLIDE 26

The linear model with new notation

Remember that we wrote the linear model as the following for

all 𝑗 ∈ {1, … , 𝑜}: 𝑧𝑗 = 𝛾0 + 𝑦𝑗𝛾1 + 𝑨𝑗𝛾2 + 𝑣𝑗

Imagine we had an 𝑜 of 4. We could write out each formula:

𝑧1 = 𝛾0 + 𝑦1𝛾1 + 𝑨1𝛾2 + 𝑣1 (unit 1) 𝑧2 = 𝛾0 + 𝑦2𝛾1 + 𝑨2𝛾2 + 𝑣2 (unit 2) 𝑧3 = 𝛾0 + 𝑦3𝛾1 + 𝑨3𝛾2 + 𝑣3 (unit 3) 𝑧4 = 𝛾0 + 𝑦4𝛾1 + 𝑨4𝛾2 + 𝑣4 (unit 4)

26 / 64

SLIDE 27

The linear model with new notation

𝑧1 = 𝛾0 + 𝑦1𝛾1 + 𝑨1𝛾2 + 𝑣1 (unit 1) 𝑧2 = 𝛾0 + 𝑦2𝛾1 + 𝑨2𝛾2 + 𝑣2 (unit 2) 𝑧3 = 𝛾0 + 𝑦3𝛾1 + 𝑨3𝛾2 + 𝑣3 (unit 3) 𝑧4 = 𝛾0 + 𝑦4𝛾1 + 𝑨4𝛾2 + 𝑣4 (unit 4)

We can write this as:

⎡ ⎢ ⎢ ⎢ ⎣ 𝑧1 𝑧2 𝑧3 𝑧4 ⎤ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎣ 1 1 1 1 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾0 + ⎡ ⎢ ⎢ ⎢ ⎣ 𝑦1 𝑦2 𝑦3 𝑦4 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾1 + ⎡ ⎢ ⎢ ⎢ ⎣ 𝑨1 𝑨2 𝑨3 𝑨4 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾2 + ⎡ ⎢ ⎢ ⎢ ⎣ 𝑣1 𝑣2 𝑣3 𝑣4 ⎤ ⎥ ⎥ ⎥ ⎦

Outcome is a linear combination of the the 𝐲, 𝐴, and 𝐯 vectors

27 / 64

SLIDE 28

Grouping things into matrices

Can we write this in a more compact form? Yes! Let 𝐘 and 𝜸

be the following: 𝐘

(4×3) =

⎡ ⎢ ⎢ ⎢ ⎣ 1 𝑦1 𝑨1 1 𝑦2 𝑨2 1 𝑦3 𝑨3 1 𝑦4 𝑨4 ⎤ ⎥ ⎥ ⎥ ⎦ 𝜸

(3×1)

= ⎡ ⎢ ⎢ ⎣ 𝛾0 𝛾1 𝛾2 ⎤ ⎥ ⎥ ⎦

28 / 64

SLIDE 29

Matrix multiplication by a vector

We can write this more compactly as a matrix

(post-)multiplied by a vector: ⎡ ⎢ ⎢ ⎢ ⎣ 1 1 1 1 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾0 + ⎡ ⎢ ⎢ ⎢ ⎣ 𝑦1 𝑦2 𝑦3 𝑦4 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾1 + ⎡ ⎢ ⎢ ⎢ ⎣ 𝑨1 𝑨2 𝑨3 𝑨4 ⎤ ⎥ ⎥ ⎥ ⎦ 𝛾2 = 𝐘𝜸

Multiplication of a matrix by a vector is just the linear

combination of the columns of the matrix with the vector elements as weights/coeffjcients.

And the left-hand side here only uses scalars times vectors,

which is easy!

29 / 64

SLIDE 30

General matrix by vector multiplication

𝐁 is a 𝑜 × 𝑙 matrix
𝐜 is a 𝑙 × 1 column vector
Columns of 𝐁 have to match rows of 𝐜
Let 𝐛𝑘 be the 𝑘th column of 𝐵. Then we can write:

𝐝

(𝑜×1) = 𝐁𝐜 = 𝑐1𝐛1 + 𝑐2𝐛2 + ⋯ + 𝑐𝑙𝐛𝑙

𝐝 is linear combination of the columns of 𝐁

30 / 64

SLIDE 31

Back to regression

𝐘 is the 𝑜 × (𝑙 + 1) design matrix of independent variables
𝜸 be the (𝑙 + 1) × 1 column vector of coeffjcients.
𝐘𝜸 will be 𝑜 × 1:

𝐘𝜸 = 𝛾0 + 𝛾1𝐲1 + 𝛾2𝐲2 + ⋯ + 𝛾𝑙𝐲𝑙

Thus, we can compactly write the linear model as the

following: 𝐳

(𝑜×1) = 𝐘𝜸 (𝑜×1)

+ 𝐯

(𝑜×1)

31 / 64

SLIDE 32

Inner product

The inner (or dot) product of a two column vectors 𝐛 and 𝐜

(of equal dimension, 𝑙 × 1): ⟨𝐛, 𝐜⟩ = 𝐛′𝐜 = 𝑏1𝑐1 + 𝑏2𝑐2 + ⋯ + 𝑏𝑙𝑐𝑙

If 𝐛′𝐜 = 0 we say that the two vectors are orthogonal.
With 𝐝 = 𝐁𝐜, we can write the entries of 𝐝 as inner products:

𝑑𝑗 = 𝐛′

𝑗𝐜

If 𝐲′

𝑗 is the 𝑗th row of 𝐘, then we write the linear model as:

𝑧𝑗 = 𝐲′

𝑗𝜸 + 𝑣𝑗

= 𝛾0 + 𝑦𝑗1𝛾1 + 𝑦𝑗2𝛾2 + ⋯ + 𝑦𝑗𝑙𝛾𝑙 + 𝑣𝑗

32 / 64

SLIDE 33

4/ OLS in matrix form

33 / 64

SLIDE 34

Matrix multiplication

What if, instead of a column vector 𝑐, we have a matrix 𝐂

with dimensions 𝑙 × 𝑛.

How do we do multiplication like so 𝐃 = 𝐁𝐂?
Each column of the new matrix is just matrix by vector

multiplication: 𝐃 = [𝐝1 𝐝2 ⋯ 𝐝𝑛] 𝐝𝑘 = 𝐁𝐜𝑘

Thus, each column of 𝐃 is a linear combination of the

columns of 𝐁.

34 / 64

SLIDE 35

Properties of matrix multiplication

Matrix multiplication is not commutative: 𝐁𝐂 ≠ 𝐂𝐁
It is associative and distributive:

𝐁(𝐂𝐃) = (𝐁𝐂)𝐃 𝐁(𝐂 + 𝐃) = 𝐁𝐂 + 𝐁𝐃

The transpose: (𝐁𝐂)′ = 𝐂′𝐁′

35 / 64

SLIDE 36

Square matrices and the diagonal

A square matrix has equal numbers of rows and columns.
The identity matrix, 𝐉𝑙 is a 𝑙 × 𝑙 square matrix, with 1s along

the diagonal and 0s everywhere else. 𝐉3 = ⎡ ⎢ ⎢ ⎣ 1 1 1 ⎤ ⎥ ⎥ ⎦

The 𝑙 × 𝑙 identity matrix multiplied by any 𝑛 × 𝑙 matrix

returns the matrix: 𝐁𝐉𝑙 = 𝐁

36 / 64

SLIDE 37

Identity matrix

To get the diagonal of a matrix in R, use the diag() function:

b <- matrix(1:4, nrow = 2, ncol = 2) b ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 diag(b) ## [1] 1 4

diag() also creates identity matrices in R:

diag(3) ## [,1] [,2] [,3] ## [1,] 1 ## [2,] 1 ## [3,] 1

37 / 64

SLIDE 38

Multiple linear regression in matrix form

Let ̂

𝜸 be the matrix of estimated regression coeffjcients and ̂ 𝐳 be the vector of fjtted values: ̂ 𝜸 = ⎡ ⎢ ⎢ ⎢ ⎣ ̂ 𝛾0 ̂ 𝛾1 ⋮ ̂ 𝛾𝑙 ⎤ ⎥ ⎥ ⎥ ⎦ ̂ 𝐳 = 𝐘 ̂ 𝜸

It might be helpful to see this again more written out:

̂ 𝐳 = ⎡ ⎢ ⎢ ⎢ ⎣ ̂ 𝑧1 ̂ 𝑧2 ⋮ ̂ 𝑧𝑜 ⎤ ⎥ ⎥ ⎥ ⎦ = 𝐘 ̂ 𝜸 = ⎡ ⎢ ⎢ ⎢ ⎣ 1 ̂ 𝛾0 + 𝑦11 ̂ 𝛾1 + 𝑦12 ̂ 𝛾2 + ⋯ + 𝑦1𝑙 ̂ 𝛾𝑙 1 ̂ 𝛾0 + 𝑦21 ̂ 𝛾1 + 𝑦22 ̂ 𝛾2 + ⋯ + 𝑦2𝑙 ̂ 𝛾𝑙 ⋮ 1 ̂ 𝛾0 + 𝑦𝑜1 ̂ 𝛾1 + 𝑦𝑜2 ̂ 𝛾2 + ⋯ + 𝑦𝑜𝑙 ̂ 𝛾𝑙 ⎤ ⎥ ⎥ ⎥ ⎦

38 / 64

SLIDE 39

Residuals

We can easily write the residuals in matrix form:

̂ 𝐯 = 𝐳 − 𝐘 ̂ 𝜸

The norm or length of a vector generalizes Euclidean distance

and is just the square root of the squared entries, ‖𝐛‖ = √𝑏2

1 + 𝑏2 2 + ⋯ + 𝑏2 𝑙

We can write the norm in terms of inner product: ‖𝐛‖2 = 𝐛′𝐛
Thus we can compactly write the sum of the squared residuals

as: ‖ ̂ 𝐯‖2 = ̂ 𝐯′ ̂ 𝐯 =

𝑜

∑

𝑗=1

̂ 𝑣2

𝑗

39 / 64

SLIDE 40

OLS estimator in matrix form

OLS still minimizes sum of the squared residuals

arg min

𝐜∈ℝ𝑙+1 ‖ ̂

𝐯‖2 = arg min

𝐜∈ℝ𝑙+1 ‖𝐳 − 𝐘𝐜‖2

Take (matrix) derivatives, set equal to 0
Resulting fjrst order conditions:

𝐘′(𝐳 − 𝐘 ̂ 𝜸) = 0

Rearranging:

𝐘′𝐘 ̂ 𝜸 = 𝐘′𝐳

In order to isolate ̂

𝜸, we need to move the 𝐘′𝐘 term to the

ther side of the equals sign.
We’ve learned about matrix multiplication, but what about

matrix “division”?

40 / 64

SLIDE 41

Scalar inverses

What is division in its simplest form? 1

𝑏 is the value such that

𝑏 1

𝑏 = 1:

For some algebraic expression: 𝑏𝑣 = 𝑐, let’s solve for 𝑣:

1 𝑏𝑏𝑣 = 1 𝑏𝑐 𝑣 = 𝑐 𝑏

Need a matrix version of this: 1

𝑏.

41 / 64

SLIDE 42

Matrix inverses

Defjnition If it exists, the inverse of square matrix 𝐁, denoted

𝐁−1, is the matrix such that 𝐁−1𝐁 = 𝐉.

We can use the inverse to solve (systems of) equations:

𝐁𝐯 = 𝐜 𝐁−𝟐𝐁𝐯 = 𝐁−𝟐𝐜 𝐉𝐯 = 𝐁−𝟐𝐜 𝐯 = 𝐁−𝟐𝐜

If the inverse exists, we say that 𝐁 is invertible or nonsingular.

42 / 64

SLIDE 43

Back to OLS

Let’s assume, for now, that the inverse of 𝐘′𝐘 exists (we’ll

come back to this)

Then we can write the OLS estimator as the following:

̂ 𝜸 = (𝐘′𝐘)−1𝐘′𝐳

Memorize this: “ex prime ex inverse ex prime y” sear it into

your soul.

43 / 64

SLIDE 44

Understanding check

Suppose 𝐳 is 𝑜 × 1 and 𝐘 is 𝑜 × (𝑙 + 1).
What are the dimensions of 𝐘′𝐘?
True/False: 𝐘′𝐘 is symmetric.

▶ Note: A square matrix is symmetric if 𝐁 = 𝐁′.

What are the dimensions of (𝐘′𝐘)−1?
What are the dimensions of 𝐘′𝐳?
What are the dimensions of ̂

𝜸?

44 / 64

SLIDE 45

Implications of OLS

We can generalize some mechanical results about OLS.
The independent variables are orthogonal to the residuals:

𝐘′ ̂ 𝐯 = 𝐘′(𝐳 − 𝐘 ̂ 𝜸) = 0

The fjtted values are orthogonal to the residuals:

̂ 𝐳′ ̂ 𝐯 = (𝐘 ̂ 𝜸)′ ̂ 𝐯 = ̂ 𝜸′𝐘′ ̂ 𝐯 = 0

45 / 64

SLIDE 46

OLS by hand in R

̂ 𝜸 = (𝐘′𝐘)−1𝐘′𝐳

First we need to get the design matrix and the response:

X <- model.matrix(trust_neighbors ~ exports + age + male + urban_dum + malaria_ecology, data = nunn) dim(X) ## [1] 20325 6 ## model.frame always puts the response in the first column y <- model.frame(trust_neighbors ~ exports + age + male + urban_dum + malaria_ecology, data = nunn)[,1] length(y) ## [1] 20325

46 / 64

SLIDE 47

OLS by hand in R

̂ 𝜸 = (𝐘′𝐘)−1𝐘′𝐳

Use the solve() for inverses and %*% for matrix

multiplication:

solve(t(X) %*% X) %*% t(X) %*% y ## (Intercept) exports age male urban_dum ## [1,] 1.503 -0.001021 0.005045 0.02784

0.2739

## malaria_ecology ## [1,] 0.01941 coef(mod) ## (Intercept) exports age male ## 1.503037

0.001021

0.005045 0.027837 ## urban_dum malaria_ecology ##

0.273872

0.019411

47 / 64

SLIDE 48

Intuition for the OLS in matrix form

̂ 𝜸 = (𝐘′𝐘)−1𝐘′𝐳

What’s the intuition here?
“Numerator” 𝐘′𝐳: is roughly composed of the covariances

between the columns of 𝐘 and 𝐳

“Denominator” 𝐘′𝐘 is roughly composed of the sample

variances and covariances of variables within 𝐘

Thus, we have something like:

̂ 𝜸 ≈ (variance of 𝐘)−1(covariance of 𝐘 & 𝐳)

This is a rough sketch and isn’t strictly true, but it can

provide intuition.

48 / 64

SLIDE 49

5/ OLS inference in matrix form

49 / 64

SLIDE 50

Random vectors

A random vector is a vector of random variables:

𝐲𝑗 = [ 𝑦𝑗1 𝑦𝑗2 ]

Here, 𝐲𝑗 is a random vector and 𝑦𝑗1 and 𝑦𝑗2 are random

variables.

When we talk about the distribution of 𝐲𝑗, we are talking

about the joint distribution of 𝑦𝑗1 and 𝑦𝑗2.

50 / 64

SLIDE 51

Distribution of random vectors

Expectation of random vectors:

𝔽[𝐲𝑗] = [ 𝔽[𝑦𝑗1] 𝔽[𝑦𝑗2] ]

Variance of random vectors:

𝕎[𝐲𝑗] = [ 𝕎[𝑦𝑗1] Cov[𝑦𝑗1, 𝑦𝑗2] Cov[𝑦𝑗1, 𝑦𝑗2] 𝕎[𝑦𝑗2] ]

Properties of this variance-covariance matrix:

▶ if 𝐛 is constant, then 𝕎[𝐛′𝐲𝑗] = 𝐛′𝕎[𝐲𝑗]𝐛. ▶ if matrix 𝐁 and vector 𝐜 are constant, then

𝕎[𝐁𝐲𝑗 + 𝐜] = 𝐁𝕎[𝐲𝑗]𝐁′

51 / 64

SLIDE 52

Most general OLS assumptions

1. Linearity: 𝑧𝑗 = 𝐲′

𝑗𝜸 + 𝑣𝑗

2. Random/iid sample: (𝑧𝑗, 𝐲′

𝑗) are a iid sample from the

population.

3. No perfect collinearity: 𝐘 is an 𝑜 × (𝑙 + 1) matrix with rank

𝑙 + 1

4. Zero conditional mean: 𝔽[𝑣𝑗|𝐲𝑗] = 0
5. Homoskedasticity: 𝕎[𝑣𝑗|𝐲𝑗] = 𝜏2

𝑣

6. Normality: 𝑣𝑗|𝐲𝑗 ∼ 𝑂(0, 𝜏2

𝑣)

52 / 64

SLIDE 53

Matrix rank

Defjnition The rank of a matrix is the maximum number of

linearly independent columns.

Defjnition The columns of a matrix 𝐘 are linearly

independent if 𝐘𝐜 = 0 if and only if 𝐜 = 0: 𝑐1𝐲1 + 𝑐2𝐲𝟑 + ⋯ + 𝑐𝑙𝐲𝑙 = 0

Example violation: one column is a linear function of the
thers.

▶ 3 covariates with 𝐲1 = 𝐲2 + 𝐲3

0 = 𝑐1𝐲1 + 𝑐2𝐲2 + 𝑐3𝐲3 = 𝑐1(𝐲2 + 𝐲3) + 𝑐2𝐲2 + 𝑐3𝐲3 = (𝑐1 + 𝑐2)𝐲2 + (𝑐1 + 𝑐3)𝐲3

…equals 0 when 𝑐1 = −𝑐2 = −𝑐3 ⇝ not linearly independent!

53 / 64

SLIDE 54

Rank and matrix inversion

If 𝐘 is 𝑜 × (𝑙 + 1) has rank 𝑙 + 1, then all of its columns are

linearly independent

▶ Generalization of no perfect collinearity to arbitrary 𝑙.

𝐘 has rank 𝑙 + 1 ⇝ (𝐘′𝐘) has rank 𝑙 + 1
If a square (𝑙 + 1) × (𝑙 + 1) matrix has rank 𝑙 + 1, then it is

invertible.

𝐘 has rank 𝑙 + 1 ⇝ (𝐘′𝐘)−1 exists and is unique.

54 / 64

SLIDE 55

Zero conditional mean error

Combining zero mean conditional error and iid we have:

𝔽[𝑣𝑗|𝐘] = 𝔽[𝑣𝑗|𝐲𝑗] = 0

Stacking these into the vector of errors:

𝔽[𝐯|𝐘] = ⎡ ⎢ ⎢ ⎢ ⎣ 𝔽[𝑣1|𝐘] 𝔽[𝑣2|𝐘] ⋮ 𝔽[𝑣𝑜|𝐘] ⎤ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎣ ⋮ ⎤ ⎥ ⎥ ⎥ ⎦

55 / 64

SLIDE 56

Expectation of OLS

Useful to write OLS as:

̂ 𝜸 = (𝐘′𝐘)−1 𝐘′𝐳 = (𝐘′𝐘)−1 𝐘′(𝐘𝜸 + 𝐯) = (𝐘′𝐘)−1 𝐘′𝐘𝜸 + (𝐘′𝐘)−1 𝐘′𝐯 = 𝜸 + (𝐘′𝐘)−1 𝐘′𝐯

Under assumptions 1-4, OLS is conditionally unbiased for 𝜸:

𝔽[ ̂ 𝜸|𝐘] = 𝜸 + (𝐘′𝐘)−1 𝐘′𝔽[𝐯|𝐘] = 𝜸 + (𝐘′𝐘)−1 𝐘′𝟏 = 𝜸

Implies that OLS is unconditionally unbiased: 𝔽[ ̂

𝜸] = 𝜸

56 / 64

SLIDE 57

Variance of OLS

What about 𝕎[ ̂

𝜸|𝐘]?

Using some facts about variances and matrices, can derive:

𝕎[ ̂ 𝜸|𝐘] = (𝐘′𝐘)−1 𝐘′𝕎[𝐯|𝐘]𝐘 (𝐘′𝐘)−1

What the covariance matrix of the errors, 𝕎[𝐯|𝐘]?

This matrix is symmetric since cov(𝑣𝑗, 𝑣𝑘) = cov(𝑣𝑗, 𝑣𝑘)

57 / 64

SLIDE 58

Homoskedasicity

By homoskedasticity and iid, for any units 𝑗, 𝑡, 𝑢:

▶ 𝕎[𝑣𝑗|𝐘] = 𝕎[𝑣𝑗|𝐲𝑗] = 𝜏2

𝑣 (constant variance)

▶ cov[𝑣𝑡, 𝑣𝑢|𝐘] = 0 (uncorrelated errors)

Then, the covariance matrix of the errors is simply:

𝕎[𝐯|𝐘] = 𝜏2

𝑣𝐉𝑜 =

⎡ ⎢ ⎢ ⎢ ⎣ 𝜏2

𝑣

… 𝜏2

𝑣

… ⋮ … 𝜏2

𝑣

⎤ ⎥ ⎥ ⎥ ⎦

Thus, we have the following:

𝕎[ ̂ 𝜸|𝐘] = (𝐘′𝐘)−1 𝐘′𝕎[𝐯|𝐘]𝐘 (𝐘′𝐘)−1 = (𝐘′𝐘)−1 𝐘′(𝜏2

𝑣𝐉𝑜)𝐘 (𝐘′𝐘)−1

= 𝜏2

𝑣 (𝐘′𝐘)−1 𝐘′𝐘 (𝐘′𝐘)−1

= 𝜏2 (𝐘′𝐘)−1

58 / 64

SLIDE 59

Sampling variance for OLS estimates

Under assumptions 1-5, the sampling variance of the OLS

estimator can be written in matrix form as the following: 𝕎[ ̂ 𝜸|𝐘] = 𝜏2

𝑣(𝐘′𝐘)−1

This symmetric matrix looks like this:

59 / 64

SLIDE 60

Inference in the general setuing

Under assumption 1-5 in large samples:

̂ 𝛾𝑘 − 𝛾𝑘 ̂ se[ ̂ 𝛾𝑘] ∼ 𝑂(0, 1)

In small samples, under assumptions 1-6,

̂ 𝛾𝑘 − 𝛾𝑘 ̂ se[ ̂ 𝛾𝑘] ∼ 𝑢𝑜−(𝑙+1)

Thus, under the null of 𝐼0 ∶ 𝛾𝑘 = 0, we know that

̂ 𝛾𝑘 ̂ se[ ̂ 𝛾𝑘] ∼ 𝑢𝑜−(𝑙+1)

Here, the estimated SEs come from:

̂ 𝕎[ ̂ 𝜸] = ̂ 𝜏2

𝑣(𝐘′𝐘)−1

̂ 𝜏2

𝑣 =

̂ 𝐯′ ̂ 𝐯 𝑜 − (𝑙 + 1)

60 / 64

SLIDE 61

Covariance matrix in R

We can access this estimated covariance matrix, ̂

𝜏2

𝑣(𝐘′𝐘)−1,

in R:

vcov(mod) ## (Intercept) exports age male ## (Intercept) 0.0004766593 1.164e-07 -7.956e-06 -6.676e-05 ## exports 0.0000001164 1.676e-09 -3.659e-10 7.283e-09 ## age

0.0000079562 -3.659e-10

2.231e-07 -7.765e-07 ## male

0.0000667572

7.283e-09 -7.765e-07 1.909e-04 ## urban_dum

0.0000965843 -4.861e-08

7.108e-07 -1.711e-06 ## malaria_ecology -0.0000069094 -2.124e-08 2.324e-10 -1.017e-07 ## urban_dum malaria_ecology ## (Intercept)

9.658e-05
6.909e-06

## exports

4.861e-08
2.124e-08

## age 7.108e-07 2.324e-10 ## male

1.711e-06
1.017e-07

## urban_dum 2.061e-04 2.724e-09 ## malaria_ecology 2.724e-09 7.590e-07

61 / 64

SLIDE 62

Standard errors from the covariance matrix

Note that the diagonal are the variances. So the square root
f the diagonal is are the standard errors:

sqrt(diag(vcov(mod))) ## (Intercept) exports age male ## 0.02183253 0.00004094 0.00047237 0.01381627 ## urban_dum malaria_ecology ## 0.01435491 0.00087123 coef(summary(mod))[, "Std. Error"] ## (Intercept) exports age male ## 0.02183253 0.00004094 0.00047237 0.01381627 ## urban_dum malaria_ecology ## 0.01435491 0.00087123

62 / 64

SLIDE 63

Nunn & Wantchekon

63 / 64

SLIDE 64

Wrapping up

You have the full power of matrices.
Key to writing the OLS estimator and discussing higher level

concepts in regression and beyond.

Next week: diagnosing and fjxing problems with the linear

model.

64 / 64