Machine Learning for Computational Linguistics Regression ar ltekin - - PowerPoint PPT Presentation

machine learning for computational linguistics
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Computational Linguistics Regression ar ltekin - - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Regression ar ltekin University of Tbingen Seminar fr Sprachwissenschaft April 26/28, 2016 Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tbingen


slide-1
SLIDE 1

Machine Learning for Computational Linguistics

Regression Çağrı Çöltekin

University of Tübingen Seminar für Sprachwissenschaft

April 26/28, 2016

slide-2
SLIDE 2

Practical matters Revisiting entropy Statistical inference Regression

▶ Course credits:

9 ECTS with term paper 6 ECTS without term paper

▶ Homeworks & evaluation:

For each homework, you either get

0 not satisfactory or not submitted [6, 10] satisfactory and on time

▶ Late homeworks are not accepted

Please follow the instructions precisely!

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 1 / 35

slide-3
SLIDE 3

Practical matters Revisiting entropy Statistical inference Regression

Entropy of your random numbers

1 1 2 1 3 4 2 5 6 1 7 6 8 4 9 2 10 1 11 3 12 2 13 3 14 2 15 16 1 17 5 18 19 3 20 2

If the data was really uniformly distributed: .

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 2 / 35

slide-4
SLIDE 4

Practical matters Revisiting entropy Statistical inference Regression

Entropy of your random numbers

1 1 2 1 3 4 2 5 6 1 7 6 8 4 9 2 10 1 11 3 12 2 13 3 14 2 15 16 1 17 5 18 19 3 20 2

H(X) = − ∑

x

P(x) log2 P(x) = 2.61 If the data was really uniformly distributed: .

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 2 / 35

slide-5
SLIDE 5

Practical matters Revisiting entropy Statistical inference Regression

Entropy of your random numbers

1 1 2 1 3 4 2 5 6 1 7 6 8 4 9 2 10 1 11 3 12 2 13 3 14 2 15 16 1 17 5 18 19 3 20 2

H(X) = − ∑

x

P(x) log2 P(x) = 2.61 If the data was really uniformly distributed: H(X) = 4.32.

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 2 / 35

slide-6
SLIDE 6

Practical matters Revisiting entropy Statistical inference Regression

Coding a four-letter alphabet

letter prob code 1 code 2 a 1/2 0 0 b 1/4 0 1 10 c 1/8 1 0 110 d 1/8 1 1 111 Average code length of a string under code 1: 1 22 + 1 42 + 1 82 + 1 82 = 2.0bits Average code length of a string under code 2: 1 21 + 1 42 + 1 83 + 1 83 = 1.75bits = H

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 3 / 35

slide-7
SLIDE 7

Practical matters Revisiting entropy Statistical inference Regression

Statistical inference and estimation

▶ Statistical inference is about making generalizations that go

beyond the data at hand (training set, or experimental sample)

▶ In a typical scenario, we (implicitly) assume that a particular

class of models describe the real-world process, and try to fjnd the best model within the class of models

▶ In most cases, our models are parametrized: the model is

defjned by a set of parameters

▶ The task, then, becomes estimating the parameters from the

training set such that the resulting model is useful for unseen instances

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 4 / 35

slide-8
SLIDE 8

Practical matters Revisiting entropy Statistical inference Regression

Estimation of model parameters

A typical statistical model can be formulated as y = f(x; w) + ϵ x is the input to the model y is the quantity or label assigned to for a given input w is the parameter(s) of the model f(x; w) is the model’s estimate of output y given the input x, sometimes denoted as ˆ y ϵ represents the uncertainty or noise that we cannot explain or account for

▶ In machine learning, focus is correct prediction of y ▶ In statistics, the focus is on inference (testing hypotheses or

explaining the observed phenomena)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 5 / 35

slide-9
SLIDE 9

Practical matters Revisiting entropy Statistical inference Regression

Estimating parameters: Bayesian approach

Given the training data X, we fjnd the posterior distribution p(w|X) = p(X|w)p(w) p(X)

▶ The result, posterior, is a probability distribution of the

parameter(s)

▶ One can get a point estimate of w, for example, by

calculating the expected value from the distribution

▶ The posterior distribution also contains the information on the

uncertainty of the estimate

▶ Prior information can be specifjed by the prior distribution

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 6 / 35

slide-10
SLIDE 10

Practical matters Revisiting entropy Statistical inference Regression

Estimating parameters: frequentist approach

Given the training data X, we fjnd the value of w that maximizes the likelihood ˆ w = arg min

w

p(X|w)

▶ The likelihood function p(X|w), often denoted L(w|X), is the

probability of data given w for discrete variables, and the value of probability mass function for the continuous variables

▶ The problem becomes searching for the maximum value of a

function

▶ Note that we cannot make probabilistic statements about w ▶ Uncertainty of the estimate is less straightforward

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 7 / 35

slide-11
SLIDE 11

Practical matters Revisiting entropy Statistical inference Regression

A simple example: estimation of the population mean

We assume that data observed comes from the model: y = µ + ϵ where, ϵ ∼ N(0, σ2) An example:

▶ Let’s assume that we are estimating the average number of

characters in twitter messages. We will use two data sets:

▶ 87, 101, 88, 45, 138

▶ The mean of the sample (¯

x) is 91.8

▶ Variance of the sample (sd2) is 1111.7 (sd = 33.34)

▶ 87, 101, 88, 45, 138, 66, 79, 78, 140, 102

▶ ¯

x = 92.4

▶ sd2 = 876.71 (sd = 29.61) Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 8 / 35

slide-12
SLIDE 12

Practical matters Revisiting entropy Statistical inference Regression

Estimating mean: Bayesian way

We simply use Bayes’ formula: p(µ|D) = p(D|µ)p(µ) p(D)

▶ With a vague prior (high variance/entropy), the posterior

mean is (almost) the same as the mean of the data

▶ With a prior with lower variance, posterior is between the prior

and the data mean

▶ Posterior variance indicates the uncertainty of our estimate.

With more data, we get a more certain estimate

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 9 / 35

slide-13
SLIDE 13

Practical matters Revisiting entropy Statistical inference Regression

Estimating mean: Bayesian way

vague prior, small sample

50 100 150 200 0.00 0.01 0.02 0.03 0.04 0.05 x density Prior: N(70, 1000) Likelihood: N(91.8, 33.34) Posterior: N(91.78, 14.91)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 10 / 35

slide-14
SLIDE 14

Practical matters Revisiting entropy Statistical inference Regression

Estimating mean: Bayesian way

vague prior, larger sample

50 100 150 200 0.00 0.01 0.02 0.03 0.04 0.05 x density Prior: N(70, 1000) Likelihood: N(92.40, 29.61) Posterior: N(92.39, 9.36)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 11 / 35

slide-15
SLIDE 15

Practical matters Revisiting entropy Statistical inference Regression

Estimating mean: Bayesian way

visualization

50 100 150 200 0.00 0.01 0.02 0.03 0.04 0.05 x density Prior: N(70, 50) Likelihood: N(92.40, 29.61) Posterior: N(91.64, 9.20)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 12 / 35

slide-16
SLIDE 16

Practical matters Revisiting entropy Statistical inference Regression

Estimating mean: frequentist way

▶ The MLE of the mean of the population is the mean of the

sample

▶ For 5-tweet sample: ˆ

µ = ¯ x = 91.8

▶ For 10-tweet sample: ˆ

µ = ¯ x = 92.4

▶ We express the uncertainty in terms of standard error of the

mean (SE) SE¯

x = sdx

√n which corresponds to the means of the (hypothetical) samples

  • f the same size drawn from the same population.

▶ For 5-tweet sample: SE¯

x = 33.34/

√ 5 = 14.91

▶ For 10-tweet sample: SE¯

x = 29.61/

√ 10 = 9.36

▶ A rough estimate for a 95% confjdence interval is ¯

x ± 2SE¯

x

▶ For 5-tweet sample: 91.8 ± 2 × 14.91 = [61.98, 121.62] ▶ For 10-tweet sample: 92.4 ± 2 × 9.36 = [83.04, 101.76] Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 13 / 35

slide-17
SLIDE 17

Practical matters Revisiting entropy Statistical inference Regression

Regression

▶ Regression is a supervised method for predicting value of a

continuous response variables based on a number of predictors

▶ We estimate the conditional expectation of the outcome

variable given the predictor(s)

▶ If the outcome is a label, the problem is called classifjcation.

But the border between the two often is not that clear

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 14 / 35

slide-18
SLIDE 18

Practical matters Revisiting entropy Statistical inference Regression

The linear equation: a reminder

y = a + bx a (intercept) is where the line crosses the y axis. b (slope) is the change in y as x is increased one unit. x y y = 1 − x y = 1

2x

y = 2 + 1

2x

y = −1

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 15 / 35

slide-19
SLIDE 19

Practical matters Revisiting entropy Statistical inference Regression

The linear equation: a reminder

y = a + bx a (intercept) is where the line crosses the y axis. b (slope) is the change in y as x is increased one unit. What is the correlation between x and y for each line (relation)? x y y = 1 − x y = 1

2x

y = 2 + 1

2x

y = −1

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 15 / 35

slide-20
SLIDE 20

Practical matters Revisiting entropy Statistical inference Regression

The simple linear model

yi = a + bxi + ϵi y is the outcome (or response, or dependent) variable. The index i represents each unit observation/measurement (sometimes called a ‘case’). x is the predictor (or explanatory, or independent) variable. a is the intercept. b is the slope of the regression line. a and b are called coeffjcients or parameters. a + bx is the deterministic part of the model. It is the model’s prediction of y (ˆ y), given x. ϵ is the residual, error, or the variation that is not accounted for by the model. Assumed to be normally distributed with 0 mean

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 16 / 35

slide-21
SLIDE 21

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = a + bxi + ϵi Sometimes, Greek letters and are used for intercept and the slope, respectively. Another common notation to use only , , but use subscripts, indicating the intercept and indicating the slope. In machine learning it is common to use for all coeffjcients (sometimes you may see used instead of ) Sometimes coeffjcients wear hats, to emphasize that they are estimates. Often, we use the vector notation for both input(s) and coeffjcients: and

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-22
SLIDE 22

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = α + βxi + ϵi

▶ Sometimes, Greek letters α and β are used for intercept and

the slope, respectively. Another common notation to use only , , but use subscripts, indicating the intercept and indicating the slope. In machine learning it is common to use for all coeffjcients (sometimes you may see used instead of ) Sometimes coeffjcients wear hats, to emphasize that they are estimates. Often, we use the vector notation for both input(s) and coeffjcients: and

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-23
SLIDE 23

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = β0 + β1xi + ϵi

▶ Sometimes, Greek letters α and β are used for intercept and

the slope, respectively.

▶ Another common notation to use only b, β θ, but use

subscripts, 0 indicating the intercept and 1 indicating the slope. In machine learning it is common to use for all coeffjcients (sometimes you may see used instead of ) Sometimes coeffjcients wear hats, to emphasize that they are estimates. Often, we use the vector notation for both input(s) and coeffjcients: and

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-24
SLIDE 24

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = w0 + w1xi + ϵi

▶ Sometimes, Greek letters α and β are used for intercept and

the slope, respectively.

▶ Another common notation to use only b, β θ, but use

subscripts, 0 indicating the intercept and 1 indicating the slope.

▶ In machine learning it is common to use w for all coeffjcients

(sometimes you may see b used instead of w0) Sometimes coeffjcients wear hats, to emphasize that they are estimates. Often, we use the vector notation for both input(s) and coeffjcients: and

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-25
SLIDE 25

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = ˆ w0 + ˆ w1xi + ϵi

▶ Sometimes, Greek letters α and β are used for intercept and

the slope, respectively.

▶ Another common notation to use only b, β θ, but use

subscripts, 0 indicating the intercept and 1 indicating the slope.

▶ In machine learning it is common to use w for all coeffjcients

(sometimes you may see b used instead of w0)

▶ Sometimes coeffjcients wear hats, to emphasize that they are

estimates. Often, we use the vector notation for both input(s) and coeffjcients: and

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-26
SLIDE 26

Practical matters Revisiting entropy Statistical inference Regression

Notation difgerences for the regression equation

yi = wxi + ϵi

▶ Sometimes, Greek letters α and β are used for intercept and

the slope, respectively.

▶ Another common notation to use only b, β θ, but use

subscripts, 0 indicating the intercept and 1 indicating the slope.

▶ In machine learning it is common to use w for all coeffjcients

(sometimes you may see b used instead of w0)

▶ Sometimes coeffjcients wear hats, to emphasize that they are

estimates.

▶ Often, we use the vector notation for both input(s) and

coeffjcients: w = (w0, w1) and xi = (1, xi)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 17 / 35

slide-27
SLIDE 27

Practical matters Revisiting entropy Statistical inference Regression

Visualization of regression procedure

x y

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 18 / 35

slide-28
SLIDE 28

Practical matters Revisiting entropy Statistical inference Regression

Visualization of regression procedure

x y

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 18 / 35

slide-29
SLIDE 29

Practical matters Revisiting entropy Statistical inference Regression

Visualization of regression procedure

x y

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 18 / 35

slide-30
SLIDE 30

Practical matters Revisiting entropy Statistical inference Regression

Least-squares regression

Least-squares regression is the method of determining regression coeffjcients that minimizes the sum of squared residuals (SSR). yi = w0 + w1xi

  • ˆ

yi

+ϵi We try to fjnd and , that minimize the prediction error: This minimization problem can be solved analytically, yielding:

* See appendix for the derivation. Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 19 / 35

slide-31
SLIDE 31

Practical matters Revisiting entropy Statistical inference Regression

Least-squares regression

Least-squares regression is the method of determining regression coeffjcients that minimizes the sum of squared residuals (SSR). yi = w0 + w1xi

  • ˆ

yi

+ϵi

▶ We try to fjnd w0 and w2, that minimize the prediction error:

i

ϵ2

i =

i

(yi − ˆ yi)2 = ∑

i

(yi − (w0 + w1xi))2

▶ This minimization problem can be solved analytically, yielding:

w1 = rsdy sdx w0 = ¯ y − w1¯ x

* See appendix for the derivation. Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 19 / 35

slide-32
SLIDE 32

Practical matters Revisiting entropy Statistical inference Regression

Short digression: minimizing functions

In least squares regression, we want to fjnd w0 and w1 values that minimize the quantity ∑

i

(yi − (w0 + w1xi))2

▶ Note that the above is a quadratic function of w0 and w1 ▶ This is important, since quadratic functions are convex and

have a single extreme value: we have a unique solution for our minimization problem

▶ In case of least squares regression, we are even luckier: we can

fjnd an analytic solution

▶ Even if we do not have an analytic solution, if our error

function is convex, a search procedure like gradient descent can fjnd the global minimum

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 20 / 35

slide-33
SLIDE 33

Practical matters Revisiting entropy Statistical inference Regression

Explained variation

¯ y y ˆ y x Total variation Unexplained variation Explained variation

Total variation = Unexplained variation + Explained variation y − ¯ y = y − ˆ y + ˆ y − ¯ y

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 21 / 35

slide-34
SLIDE 34

Practical matters Revisiting entropy Statistical inference Regression

Assessing the model fjt: r2

We can express the variation explained by a regression model as: Explained variation Total variation = ∑n

i (ˆ

yi − ¯ y)2 ∑n

i (yi − ¯

y)2 It can be shown that this value is the square of the correlation coeffjcient, r2, also called the coeffjcient of determination.

▶ 100 × r2 can be interpreted as ‘the percentage of variance

explained by the model’.

▶ r2 shows how well the model fjts to the data: closer the data

points to the regression line, higher the value of r2.

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 22 / 35

slide-35
SLIDE 35

Practical matters Revisiting entropy Statistical inference Regression

Regression and inference: an example

(1) The data

We want to see the efgect of mother’s IQ to four-year-old children’s cognitive test scores (Fake data, based on analysis presented in Gelman&Hill 2007). Case Kid’s Score Mother’s IQ 1 109 91 2 99 102 3 96 88 … 43 108 101 44 110 78 45 97 67

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 23 / 35

slide-36
SLIDE 36

Practical matters Revisiting entropy Statistical inference Regression

Regression and inference: an example

(2) Analysis (R output) lm(formula = kid.score ~ mother.iq) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5174 24.2375 0.145 0.885 mother.iq 0.6023 0.2471 2.437 0.019 *

  • Residual standard error: 22.59 on 43 degrees of freedom

Multiple R-squared: 0.1214, Adjusted R-squared: 0.101 F-statistic: 5.941 on 1 and 43 DF, p-value: 0.019

w1 = 0.6 Expected score difgerence between two children whose mother’s IQ difgers one unit. Mothers’ IQ explains 12% of the variation in test scores. Given the sample size, probability of fjnding a value that far from (two-tailed t-test with null hypothesis ).

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 24 / 35

slide-37
SLIDE 37

Practical matters Revisiting entropy Statistical inference Regression

Regression and inference: an example

(2) Analysis (R output) lm(formula = kid.score ~ mother.iq) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5174 24.2375 0.145 0.885 mother.iq 0.6023 0.2471 2.437 0.019 *

  • Residual standard error: 22.59 on 43 degrees of freedom

Multiple R-squared: 0.1214, Adjusted R-squared: 0.101 F-statistic: 5.941 on 1 and 43 DF, p-value: 0.019

w1 = 0.6 Expected score difgerence between two children whose mother’s IQ difgers one unit. r2 = 0.12 Mothers’ IQ explains 12% of the variation in test scores. Given the sample size, probability of fjnding a value that far from (two-tailed t-test with null hypothesis ).

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 24 / 35

slide-38
SLIDE 38

Practical matters Revisiting entropy Statistical inference Regression

Regression and inference: an example

(2) Analysis (R output) lm(formula = kid.score ~ mother.iq) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.5174 24.2375 0.145 0.885 mother.iq 0.6023 0.2471 2.437 0.019 *

  • Residual standard error: 22.59 on 43 degrees of freedom

Multiple R-squared: 0.1214, Adjusted R-squared: 0.101 F-statistic: 5.941 on 1 and 43 DF, p-value: 0.019

w1 = 0.6 Expected score difgerence between two children whose mother’s IQ difgers one unit. r2 = 0.12 Mothers’ IQ explains 12% of the variation in test scores. p = 0.02 Given the sample size, probability of fjnding a w1 value that far from 0 (two-tailed t-test with null hypothesis w1 = 0).

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 24 / 35

slide-39
SLIDE 39

Practical matters Revisiting entropy Statistical inference Regression

Notes/issues on ordinary least squares regression

▶ Response variable should be linearly related to predictor(s) ▶ Least squares estimation is sensitive to outliers ▶ The residuals should be normally distributed

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 25 / 35

slide-40
SLIDE 40

Practical matters Revisiting entropy Statistical inference Regression

You should always check your data

* This data set is known as Anscombe’s quartet (Anscombe, 1973). All four sets have the same mean, variance and fjtted regression line. Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 26 / 35

slide-41
SLIDE 41

Practical matters Revisiting entropy Statistical inference Regression

Regression with multiple predictors

yi = w0 + w1xi,1+w2x2,i + . . . + wkxk,i

  • ˆ

y

+ϵi = wxi + ϵi w0 is the intercept (as before). w1..k are the coeffjcients of the respective predictors. ϵ is the error term (residual).

▶ using vector notation the equation becomes:

yi = wxi + ϵi where w = (w0, w1, . . . , wk) and xi = ( 1, xi,1, . . . , xi,k ) It is a generalization of simple regression with some additional power and complexity.

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 27 / 35

slide-42
SLIDE 42

Practical matters Revisiting entropy Statistical inference Regression

Visualizing regression with two predictors

2 4 6 8 10 2 4 6 8 10 12 14 2 3 4 5 6 7 8

x1 y x 2

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 28 / 35

slide-43
SLIDE 43

Practical matters Revisiting entropy Statistical inference Regression

Input/output of liner regression: some notation

A regression with k input variables and n instances can be described as:      y1 y2 . . . yn     

y

=      1 x1,1 x1,2 . . . x1,k 1 x2,1 x2,2 . . . x2,k 1 . . . . . . ... . . . 1 xn,1 xn,2 . . . xn,k     

  • X

×      w0 w1 . . . wk     

w

+      ϵ0 ϵ1 . . . ϵn     

ϵ

y = Xw + ϵ

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 29 / 35

slide-44
SLIDE 44

Practical matters Revisiting entropy Statistical inference Regression

Estimation in multiple regression

y = Xw + ϵ We want to minimize the error (as a function of w): ϵ2 = J(w) = (y − Xw)2 = ∥y − Xw∥2 Our least-squres estimate is: ˆ w = arg min

w

J(w) = (XTX)−1XT Note: the least squares estimate is also the maximum likelihood estimate under the assumption of normal distribution of errors.

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 30 / 35

slide-45
SLIDE 45

Practical matters Revisiting entropy Statistical inference Regression

Issues in multiple regression estimation

▶ Overfjtting: many variables cause model to learn noise in the

data (we will return to this issue)

▶ Collinearity: high correlation between predictors increase

uncertainty of coeffjcient estimates

▶ Model/feature selection is typically needed for both prediction

and inference

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 31 / 35

slide-46
SLIDE 46

Practical matters Revisiting entropy Statistical inference Regression

Categorical predictors

▶ Categorical predictors are represented as multiple binary coded

input variables

▶ For a binary predictor, we use a single binary input. For

example, (1 for one of the values, and 0 for the other) x = { for male 1 for female

▶ For a categorical predictor with k values, we use k − 1

predictors (various coding schemes are possible). For example, for 3-values x =      (0, 0) for neutral (0, 1) for negative (1, 0) for positive

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 32 / 35

slide-47
SLIDE 47

Practical matters Revisiting entropy Statistical inference Regression

Dealing with non-linearity (to some extent)

▶ Least squares works, because the loss function is linear with

respect to parameter w

▶ Introducing non-linear combinations of inputs does not afgect

the estimation procedure. The following are still linear models yi = w0 + w1x2

i + ϵi

yi = w0 + w1log(xi) + ϵi yi = w0 + w1xi,1 + w2xi,2 + w3xi,1xi,2 + ϵi

▶ These transformations allow linear models to deal with some

non-linearities

▶ In general, we can replace input x by a function of the

input(s) Φ(x). Φ() is called a basis function

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 33 / 35

slide-48
SLIDE 48

Practical matters Revisiting entropy Statistical inference Regression

Example: polynomial basis functions

2 4 6 8 10 200 400 600 800 1000

x y

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 34 / 35

slide-49
SLIDE 49

Practical matters Revisiting entropy Statistical inference Regression

Example: polynomial basis functions

2 4 6 8 10 200 400 600 800 1000

x y

y = −221.3 + 109.9x

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 34 / 35

slide-50
SLIDE 50

Practical matters Revisiting entropy Statistical inference Regression

Example: polynomial basis functions

2 4 6 8 10 200 400 600 800 1000

x y

y = −221.3 + 109.9x y = 45.50 − 3.52x + 12.13x2

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 34 / 35

slide-51
SLIDE 51

Practical matters Revisiting entropy Statistical inference Regression

Example: polynomial basis functions

2 4 6 8 10 200 400 600 800 1000

x y

y = −221.3 + 109.9x y = 45.50 − 3.52x + 12.13x2 y = 1445.80 − 3189.13x +2604.21x2 − 1026.76x3 +218.40x4 − 25.52x5 +1.54x6

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 34 / 35

slide-52
SLIDE 52

Practical matters Revisiting entropy Statistical inference Regression

Next...

Tuesday hands-on exercises with regression Next week classifjcation

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 35 / 35

slide-53
SLIDE 53

Estimating the regression line Relationship between correlation and regression

Estimating the regression line

We express the sum of squared residuals as a function of the (unknown) regression line:

n

i=1

ϵ2

i

=

n

i=1

(yi − ˆ yi)2 =

n

i=1

(yi − (a + bxi))2 =

n

i=1

(yi − a − bxi)2 =

n

i=1

(a2 + 2abxi − 2ayi + b2x2

i − 2bxiyi + y2 i )

Thus, ∑n

i=1 ϵ2 i is function f in x, y with unknown parameters a,

b.

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.1

slide-54
SLIDE 54

Estimating the regression line Relationship between correlation and regression

Estimating the regression line

For a fjxed sample S = (x, y), we want to minimize fab(x, y) with fab(x, y) =

n

i=1

(a2 + 2abxi − 2ayi + b2x2

i − 2bxiyi + y2 i )

To minimize this function, fjnd a and b such that f′

ab(x, y) = 0.

Treat a and b as variables and fjnd partial derivatives

∂ ∂af, ∂ ∂bf

∂ ∂af = f′

xyb(a)

=

n

i=1

(2a + 2bxi − 2yi) ∂ ∂bf = f′

xya(b)

=

n

i=1

(2axi + 2bx2

i − 2xiyi)

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.2

slide-55
SLIDE 55

Estimating the regression line Relationship between correlation and regression

Relationship between correlation and regression

Recall we obtained two partial derivatives (when minimizing sum of squared residuals): f′

xyb(a)

=

n

i=1

(2a + 2bxi − 2yi) (1) f′

xya(b)

=

n

i=1

(2axi + 2bx2

i − 2xiyi)

(2) Set (1) to zero: f′

xyb(a) = 0

⇔ n · 2a +

n

i=1

(2bxi − 2yi) = 0 ⇔ n · 2a + 2b

n

i=1

xi − 2

n

i=1

yi = 0 ⇔ n · a = n · y − n · bx ⇔ a = y − bx

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.3

slide-56
SLIDE 56

Estimating the regression line Relationship between correlation and regression

Relationship between correlation and regression

Plug a = y − bx into (2) and set to zero: f′

xya(b) = 0

n

i=1

(2(y − bx)xi + 2bx2

i − 2xiyi) = 0

⇔ (y − bx)(nx) + b

n

i=1

x2

i − n

i=1

xiyi = 0 ⇔ nxy − bx2n + b

n

i=1

x2

i − n

i=1

xiyi = 0 ⇔ b(

n

i=1

x2

i − x2n) = n

i=1

xiyi − nxy ⇔ b = ∑n

i=1 xiyi − nxy

∑n

i=1 x2 i − x2n

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.4

slide-57
SLIDE 57

Estimating the regression line Relationship between correlation and regression

Relationship between correlation and regression

b = ∑n

i=1 xiyi − nxy

∑n

i=1 x2 i − x2n

⇔ b = ∑n

i=1 xiyi − nxy

∑n

i=1(xi − x)2

⇔ b = ∑n

i=1(xi − x)(yi − y)

∑n

i=1(xi − x)2

⇔ b = 1 n − 1 ∑n

i=1(xi − x)(yi − y)

(

1 n−1

∑n

i=1(xi − x)2)

⇔ b = 1 n − 1

n

i=1

(xi − x)(yi − y) σ2

x

⇔ b = ( 1 n − 1

n

i=1

(xi − x σx ) (yi − y σy )) · σy σx ⇔ b = rσy σx

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.5

slide-58
SLIDE 58

Estimating the regression line Relationship between correlation and regression

Another relation between correlation and regression

explained variance total variance = ∑n

i=1((a + bxi) − y)2

∑n

i=1(yi − y)2

= ∑n

i=1((y − bx + bxi) − y)2

∑n

i=1(yi − y)2

= ∑n

i=1 b2(xi − x)2

∑n

i=1(yi − y)2

= b2 · (σx σy )2 = r2 (σy σx )2 · (σx σy )2 = r2

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.6

slide-59
SLIDE 59

Estimating the regression line Relationship between correlation and regression

Standard error for the regression slope and intercept

SEb = sdr √∑(xi − ¯ x)2 SEa = sdr × √ 1 n + ¯ x2 ∑(xi − ¯ x)2

Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 A.7