Multiple logistic regression Richard Erickson Instructor DataCamp - - PowerPoint PPT Presentation

multiple logistic regression
SMART_READER_LITE
LIVE PREVIEW

Multiple logistic regression Richard Erickson Instructor DataCamp - - PowerPoint PPT Presentation

DataCamp Generalized Linear Models in R GENERALIZED LINEAR MODELS IN R Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in R Chapter overview Multiple logistic regression Formulas in R Model


slide-1
SLIDE 1

DataCamp Generalized Linear Models in R

Multiple logistic regression

GENERALIZED LINEAR MODELS IN R

Richard Erickson

Instructor

slide-2
SLIDE 2

DataCamp Generalized Linear Models in R

Chapter overview

Multiple logistic regression Formulas in R Model assumptions

slide-3
SLIDE 3

DataCamp Generalized Linear Models in R

Why multiple regression?

Problem: Multiple predictor variables. Which one should I include? Solution: Include all of them using multiple regression.

slide-4
SLIDE 4

DataCamp Generalized Linear Models in R

Multiple predictor variables

Simple linear models or simple GLM: Limited to 1 Slope and 1 intercept y ∼ β + β x + ϵ Multiple regression Multiple slopes and intercepts: y ∼ β + β x + β x + β x … + ϵ

1 1 1 2 3 3

slide-5
SLIDE 5

DataCamp Generalized Linear Models in R

Too much of a good thing

Theoretical maximum number of coefficients: Number of βs = Number samples Over-fitting: Using too many predictors compared to number of samples Practical maximum number of coefficients: Number of β × 10 ≈ Number of samples

slide-6
SLIDE 6

DataCamp Generalized Linear Models in R

Bus data: Two possible predictors

With bus commuter data, 2 possible predictors Number of days one commutes: CommuteDay Distance of commute: MilesOneWay Possible to build a model with both

glm(Bus ~ CommuteDay + MilesOneWay, data = bus, family = 'binomial')

slide-7
SLIDE 7

DataCamp Generalized Linear Models in R

Summary of GLM with multiple predictors

Call: glm(formula = Bus ~ CommuteDays + MilesOneWay, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max

  • 1.0732 -0.9035 -0.7816 1.3968 2.5066

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.707515 0.119719 -5.910 3.42e-09 *** CommuteDays 0.066084 0.023181 2.851 0.00436 ** MilesOneWay -0.059571 0.003218 -18.512 < 2e-16 *** #...

slide-8
SLIDE 8

DataCamp Generalized Linear Models in R

Correlation between predictors

slide-9
SLIDE 9

DataCamp Generalized Linear Models in R

Order of coefficients

No correlation between predictors Order not important y ∼ x + x + ϵ ≈ y ∼ x + x + ϵ Correlation between predictors Order may changes estimates y ∼ x + x + ϵ ≠ y ∼ x + x + ϵ

1 2 2 1 1 2 2 1

slide-10
SLIDE 10

DataCamp Generalized Linear Models in R

Let's practice!

GENERALIZED LINEAR MODELS IN R

slide-11
SLIDE 11

DataCamp Generalized Linear Models in R

Formulas in R

GENERALIZED LINEAR MODELS IN R

Richard Erickson

Instructor

slide-12
SLIDE 12

DataCamp Generalized Linear Models in R

Why care about formulas for multiple logistic regression?

Formulas backbone of regression Tricky to figure out Understanding model.matrix() key

slide-13
SLIDE 13

DataCamp Generalized Linear Models in R

Slopes

Estimates coefficient for continuous variable e.g., height = c(72.3, 21.1, 3.7, 1.0) Formula also requires a global intercept Multiple slopes: Slope for each predictor

slide-14
SLIDE 14

DataCamp Generalized Linear Models in R

Intercepts

Discrete groups used to predict factor or character in R: fish =c("red", "blue")` Single intercept has two options: Reference intercept + contrast: y ~ x Intercept for each group: y ~ x -1

slide-15
SLIDE 15

DataCamp Generalized Linear Models in R

Multiple intercepts

Estimates effect of each group compared to reference group Alphabetically the first Default has one reference group per variable

y ~ x1 + x2

Can specify one group to estimate an intercept for all groups

y ~ x1+ x2 - 1

First variable has intercept estimated for each group

slide-16
SLIDE 16

DataCamp Generalized Linear Models in R

Dummy variables

Codes group membership Used under the hood (i.e., model.matrix()) 0s and 1s for each group Example input: color = c("red", "blue") Dummy variables for y ~ colors:

intercept = c(1, 1) blue = c(0, 1)

Dummy variables for y ~ colors-1 :

red = c(1, 0) blue = c(0, 1)

slide-17
SLIDE 17

DataCamp Generalized Linear Models in R

model.matrix()

model.matrix() does legwork for us

Foundation for formulas in R Order determined by factor order Change order change with Tidyverse or factor()

> model.matrix( ~ colors) (Intercept) colorsred 1 1 1 2 1 0 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$colors [1] "contr.treatment"

slide-18
SLIDE 18

DataCamp Generalized Linear Models in R

Factor vs numeric caveat

R thinks variable is numeric e.g., month = c(1,2,3) Need to specify factor or character e.g., month = factor(c( 1, 2,

3))

> month <- c( 1, 2, 3) > model.matrix( ~ month) (Intercept) month 1 1 1 2 1 2 3 1 3 attr(,"assign") [1] 0 1 > model.matrix( ~ month) (Intercept) month2 month3 1 1 0 0 2 1 1 0 3 1 0 1 attr(,"assign") [1] 0 1 1 attr(,"contrasts") attr(,"contrasts")$month [1] "contr.treatment"

slide-19
SLIDE 19

DataCamp Generalized Linear Models in R

Let's practice!

GENERALIZED LINEAR MODELS IN R

slide-20
SLIDE 20

DataCamp Generalized Linear Models in R

Assumptions of multiple logistic regression

GENERALIZED LINEAR MODELS IN R

Richard Erickson

Instructor

slide-21
SLIDE 21

DataCamp Generalized Linear Models in R

Assumptions

Limitations also apply to Poisson and other GLMs Important assumptions: Simpson's paradox Linear, monotonic Independence Overdispersion

slide-22
SLIDE 22

DataCamp Generalized Linear Models in R

Example Simpson's paradox

slide-23
SLIDE 23

DataCamp Generalized Linear Models in R

Simpson's paradox

Key points Missing important predictor Inclusion changes outcome Easy to visualize with lm()

slide-24
SLIDE 24

DataCamp Generalized Linear Models in R

Simpson's paradox and admission data

Admissions data University of California Berkeley Graduate admission Rate of admission by department and gender Does bias exist?

slide-25
SLIDE 25

DataCamp Generalized Linear Models in R

slide-26
SLIDE 26

DataCamp Generalized Linear Models in R

Independence

Predictors If all independent, order has no effect

  • n estimates

If non-independent, order can change estimates Response What is unit of focus? Individual, groups, group of groups? Test scores Individual student? Teacher? School? District?

slide-27
SLIDE 27

DataCamp Generalized Linear Models in R

Overdispersion

Too many zeros or one (Binomial) Too many zeros, too large variance (Poisson) Variance changes Beyond scope of this course

slide-28
SLIDE 28

DataCamp Generalized Linear Models in R

Let's practice!

GENERALIZED LINEAR MODELS IN R

slide-29
SLIDE 29

DataCamp Generalized Linear Models in R

Conclusion

GENERALIZED LINEAR MODELS IN R

Richard Erickson

Instructor

slide-30
SLIDE 30

DataCamp Generalized Linear Models in R

What you've learned

How GLM extends LM: Poisson Error term Binomial Error term Understanding and plotting results GLM with multiple regression

slide-31
SLIDE 31

DataCamp Generalized Linear Models in R

Where to from here?

DataCamp (if you missed it) Extending to include random effects with Fit (GAMs) to non-linear models Decide what coefficients to use with model selection such as AIC Many other types of regression Searching and R packages documentation to learn more Multiple (linear) regression course Hierarchical and mixed-effect models generalized additive models

slide-32
SLIDE 32

DataCamp Generalized Linear Models in R

Happy coding!

GENERALIZED LINEAR MODELS IN R