Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: - - PowerPoint PPT Presentation

midterm review
SMART_READER_LITE
LIVE PREVIEW

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: - - PowerPoint PPT Presentation

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data : labeled examples ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) 1 / 61 (Lec1.) Basic setting: supervised learning Training data : labeled


slide-1
SLIDE 1

Midterm review

CS 446

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
  • 1. Lecture review
slide-7
SLIDE 7

(Lec1.) Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn)

1 / 61

slide-8
SLIDE 8

(Lec1.) Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and

1 / 61

slide-9
SLIDE 9

(Lec1.) Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and ◮ each corresponding label yi is an annotation relevant to the task—typically not easy to automatically obtain.

1 / 61

slide-10
SLIDE 10

(Lec1.) Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and ◮ each corresponding label yi is an annotation relevant to the task—typically not easy to automatically obtain. Goal: learn a function ˆ f from labeled examples, that accurately “predicts” the labels of new (previously unseen) inputs. (Note: 0 training error is easy; test/population error is what matters.)

learned predictor past labeled examples learning algorithm predicted label new (unlabeled) example

1 / 61

slide-11
SLIDE 11

(Lec2.) k-nearest neighbors classifier

Given: labeled examples D := {(xi, yi)}n

i=1

Predictor: ˆ fD,k : X → Y: On input x,

  • 1. Find the k points xi1, xi2, . . . , xik among {xi}n

i=1 “closest” to x

(the k nearest neighbors).

  • 2. Return the plurality of yi1, yi2, . . . , yik.

(Break ties in both steps arbitrarily.)

2 / 61

slide-12
SLIDE 12

(Lec2.) Choosing k

The hold-out set approach

  • 1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).
  • 2. For each k ∈ {1, 3, 5, . . . }:

◮ Construct k-NN classifier ˆ fS\V,k using S \ V . ◮ Compute error rate of ˆ fS\V,k on V (“hold-out error rate”).

  • 3. Pick the k that gives the smallest hold-out error rate.

3 / 61

slide-13
SLIDE 13

(Lec2.) Choosing k

The hold-out set approach

  • 1. Pick a subset V ⊂ S (hold-out set, a.k.a. validation set).
  • 2. For each k ∈ {1, 3, 5, . . . }:

◮ Construct k-NN classifier ˆ fS\V,k using S \ V . ◮ Compute error rate of ˆ fS\V,k on V (“hold-out error rate”).

  • 3. Pick the k that gives the smallest hold-out error rate.

(There are many other approaches.)

3 / 61

slide-14
SLIDE 14

(Lec2.) Decision trees

Directly optimize tree structure for good classification. A decision tree is a function f : X → Y, represented by a binary tree in which: ◮ Each tree node is associated with a splitting rule g : X → {0, 1}. ◮ Each leaf node is associated with a label ˆ y ∈ Y.

4 / 61

slide-15
SLIDE 15

(Lec2.) Decision trees

Directly optimize tree structure for good classification. A decision tree is a function f : X → Y, represented by a binary tree in which: ◮ Each tree node is associated with a splitting rule g : X → {0, 1}. ◮ Each leaf node is associated with a label ˆ y ∈ Y. When X = Rd, typically only consider splitting rules of the form g(x) = 1{xi > t} for some i ∈ [d] and t ∈ R. Called axis-aligned or coordinate splits. (Notation: [d] := {1, 2, . . . , d}.)

x1 > 1.7 x2 > 2.8 ˆ y = 1 ˆ y = 2 ˆ y = 3

4 / 61

slide-16
SLIDE 16

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

5 / 61

slide-17
SLIDE 17

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

ˆ y = 2

5 / 61

slide-18
SLIDE 18

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7

5 / 61

slide-19
SLIDE 19

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 ˆ y = 1 ˆ y = 3

5 / 61

slide-20
SLIDE 20

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 x2 > 2.8 ˆ y = 1

5 / 61

slide-21
SLIDE 21

(Lec2.) Decision tree example

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 x2 > 2.8 ˆ y = 1 ˆ y = 2 ˆ y = 3

5 / 61

slide-22
SLIDE 22

(Lec2.) Nearest neighbors and decision trees

Today we covered two standard machine learning methods.

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

x1 x2

Nearest neighbors. Training/fitting: memorize data. Testing/predicting: find k closest memorized points, return plurity la- bel. Overfitting? Vary k. Decision trees. Training/fitting: greedily partition space, reducing “uncertainty”. Testing/predicting: traverse tree,

  • utput leaf label.

Overfitting? Limit or prune tree. Note: both methods can ouput real numbers (regression, not classification); return median/mean of { neighbors, points reaching leaf }.

6 / 61

slide-23
SLIDE 23

(Lec3-4.) ERM setup for least squares.

◮ Predictors/model: ˆ f(x) = wTx; a linear predictor/regressor. (For linear classification: x → sgn(wTx).) ◮ Loss/penalty: the least squares loss ℓls(y, ˆ y) = ℓls(y, ˆ y) = (y − ˆ y)2. (Some conventions scale this by 1/2.) ◮ Goal: minimize least squares emprical risk

  • Rls( ˆ

f) = 1 n

n

  • i=1

ℓls(yi, ˆ f(xi)) = 1 n

n

  • i=1

(yi − ˆ f(xi))2. ◮ Specifically, we choose w ∈ Rd according to arg min

w∈Rd

  • Rls
  • x → w

Tx

  • = arg min

w∈Rd

1 n

n

  • i=1

(yi − w

Txi)2.

◮ More generally, this is the ERM approach: pick a model and minimize empirical risk over the model parameters.

7 / 61

slide-24
SLIDE 24

(Lec3-4.) ERM in general

◮ Pick a family of models/predictors F. (For today, we use linear predictors.) ◮ Pick a loss function ℓ. (For today, we chose squared loss.) ◮ Minimize the empirical risk over the model parameters. We haven’t discussed: true risk and overfitting; how to minimize; why this is a good idea.

8 / 61

slide-25
SLIDE 25

(Lec3-4.) ERM in general

◮ Pick a family of models/predictors F. (For today, we use linear predictors.) ◮ Pick a loss function ℓ. (For today, we chose squared loss.) ◮ Minimize the empirical risk over the model parameters. We haven’t discussed: true risk and overfitting; how to minimize; why this is a good idea. Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer, and tell it to minimize.

8 / 61

slide-26
SLIDE 26

(Lec3-4.) Empirical risk minimization in matrix notation

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     .

9 / 61

slide-27
SLIDE 27

(Lec3-4.) Empirical risk minimization in matrix notation

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

9 / 61

slide-28
SLIDE 28

(Lec3-4.) Empirical risk minimization in matrix notation

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R.

9 / 61

slide-29
SLIDE 29

(Lec3-4.) Empirical risk minimization in matrix notation

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A

TA)w = A Tb,

a system of linear equations called the normal equations.

9 / 61

slide-30
SLIDE 30

(Lec3-4.) Empirical risk minimization in matrix notation

Define n × d matrix A and n × 1 column vector b by A := 1 √n     ← xT

1

→ . . . ← xT

n

→     , b := 1 √n     y1 . . . yn     . Can write empirical risk as

  • R(w) = 1

n

n

  • i=1
  • yi − x

T

i w

2 = Aw − b2

2.

Necessary condition for w to be a minimizer of R: ∇ R(w) = 0, i.e., w is a critical point of R. This translates to (A

TA)w = A Tb,

a system of linear equations called the normal equations. In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

9 / 61

slide-31
SLIDE 31

(Lec3-4.) Full (factorization) SVD (new slide)

Given M ∈ Rn×d, let M = USV T denote the singular value decomposition (SVD), where ◮ U ∈ Rn×n is orthonormal, thus U TU = UU T = I, ◮ V ∈ Rd×d is orthonormal, thus V TV = V V T = I, ◮ S ∈ Rn×d has singular values s1 ≥ s2 ≥ · · · ≥ smin n,d along the diagonal and zeros elsewhere, where the number of positive singular values equals the rank of M. Some facts: ◮ SVD is not unique when the singular values are not distinct; e.g., we can write I = UIV T where U is any orthonormal matrix. ◮ Pseudoinverse S+ ∈ Rd×n of S is obtained by starting with ST and taking the reciprocal of each positive entry. ◮ Pseudoinverse of M is V S+U T. ◮ If M −1 exists, then M −1 = M +.

10 / 61

slide-32
SLIDE 32

(Lec3-4.) Thin (decomposition) SVD (new slide)

Given M ∈ Rn×d, (s, u, v) are a singular value with corresponding left and right singular vectors if Mv = su and M Tu = sv. The thin SVD of M is M = r

i=1 siuivT i , where r is the rank of M, and

◮ left singular vectors (u1, . . . , ur) are orthonormal (but we might have r < min{n, d}!) and span the column space of M, ◮ right singular vectors (v1, . . . , vr) are orthonormal (but we might have r < min{n, d}!) and span the row space of M, ◮ sigular values s1 ≥ · · · ≥ sr > 0. Some facts: ◮ Pseudoinverse M + = r

i=1 1 si viuT i .

◮ (ui)r

i=1 span t

11 / 61

slide-33
SLIDE 33

(Lec3-4.) SVD and least squares

Recall: we’d like to find w such that A

TAw = A Tb.

If w = A+b, then A

TAw =

 

r

  • i=1

siviu

T

i

   

r

  • i=1

siuiv

T

i

   

r

  • i=1

1 si viu

T

i

  b =  

r

  • i=1

siviu

T

i

   

r

  • i=1

uiu

T

i

  b = A

Tb.

Henceforth, define ˆ wols = A+b as the OLS solution. (OLS = “ordinary least squares”.) Note: in general, AA+ = r

i=1 uiuT i = I.

12 / 61

slide-34
SLIDE 34

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal.

13 / 61

slide-35
SLIDE 35

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r

i=1 siuivT i ,

Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤  

r

  • i=1

s2

i viv

T

i

  (w′−w), so w′ optimal iff w′ − w is in the right nullspace of A.

13 / 61

slide-36
SLIDE 36

(Lec3-4.) Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then Aw′ − y2 = Aw′ − Aw + Aw − y2 = Aw′ − Aw2 + 2(Aw′ − Aw)

T(Aw − y) + Aw − y2.

Since (Aw′ − Aw)

T(Aw − y) = (w′ − w) T(A TAw − A Ty) = 0,

then Aw′ − y2 = Aw′ − Aw2 + Aw − y2. This means w′ is optimal. Morever, writing A = r

i=1 siuivT i ,

Aw′−Aw2 = (w′−w)⊤(A⊤A)(w′−w) = (w′−w)⊤  

r

  • i=1

s2

i viv

T

i

  (w′−w), so w′ optimal iff w′ − w is in the right nullspace of A. (We’ll revisit all this with convexity later.)

13 / 61

slide-37
SLIDE 37

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

14 / 61

slide-38
SLIDE 38

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

14 / 61

slide-39
SLIDE 39

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb.

14 / 61

slide-40
SLIDE 40

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w).

14 / 61

slide-41
SLIDE 41

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w). ◮ Choose λ using cross-validation.

14 / 61

slide-42
SLIDE 42

(Lec3-4.) Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

  • R(w) + λw2

2

  • ver w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)! ◮ This is called ridge regression. (λ = 0 is ERM / Ordinary Least Squares.) Explicit solution (ATA + λI)−1ATb. ◮ Parameter λ controls how much attention is paid to the regularizer w2

2

relative to the data fitting term R(w). ◮ Choose λ using cross-validation. Note: in deep networks, this regularization is called “weight decay”. (Why?) Note: another popular regularizer for linear regression is ℓ1.

14 / 61

slide-43
SLIDE 43

(Lec5-6.) Geometry of linear classifiers

x1 x2 H w A hyperplane in Rd is a linear subspace

  • f dimension d−1.

◮ A hyperplane in R2 is a line. ◮ A hyperplane in R3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. A hyperplane H can be specified by a (non-zero) normal vector w ∈ Rd. The hyperplane with normal vector w is the set of points orthogonal to w: H =

  • x ∈ Rd : x

Tw = 0

  • .

Given w and its corresponding H: H splits the sets labeled positive {x : wTx > 0} and those labeled negative {x : wTw < 0}.

15 / 61

slide-44
SLIDE 44

(Lec5-6.) Classification with a hyperplane

H w span{w}

16 / 61

slide-45
SLIDE 45

(Lec5-6.) Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.)

16 / 61

slide-46
SLIDE 46

(Lec5-6.) Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.) Decision boundary is hyperplane (oriented by w): x

Tw > 0

⇐ ⇒ x2 · cos(θ) > 0 ⇐ ⇒ x on same side of H as w

16 / 61

slide-47
SLIDE 47

(Lec5-6.) Classification with a hyperplane

H w span{w} x x2 · cos θ θ Projection of x onto span{w} (a line) has coordinate x2 · cos(θ) where cos θ = xTw w2x2 . (Distance to hyperplane is x2 · | cos(θ)|.) Decision boundary is hyperplane (oriented by w): x

Tw > 0

⇐ ⇒ x2 · cos(θ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin?

16 / 61

slide-48
SLIDE 48

(Lec5-6.) Linear separability

Is it always possible to find w with sign(wTxi) = yi? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

Linearly separable. Not linearly separable.

17 / 61

slide-49
SLIDE 49

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ a · b.

18 / 61

slide-50
SLIDE 50

(Lec5-6.) Cauchy-Schwarz (new slide)

Cauchy-Schwarz inequality. |aTb| ≤ a · b.

  • Proof. If a = b,

0 ≤ a − b2 = a2 − 2a

Tb + b2 = 2a · b − 2a Tb,

which rearranges to give aTb ≤ a · b. For the case a < b, apply the preceding to b

a

  • aT

ab b

  • .

For the absolute value, apply the preceding to (a, −b).

  • 18 / 61
slide-51
SLIDE 51

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss ℓ:

  • R(w) = 1

n

n

  • i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right).

19 / 61

slide-52
SLIDE 52

(Lec5-6.) Logistic loss

Let’s state our classification goal with a generic margin loss ℓ:

  • R(w) = 1

n

n

  • i=1

ℓ(yiw

Txi);

the key properties we want: ◮ ℓ is continuous; ◮ ℓ(z) ≥ c1[z ≤ 0] = cℓzo(z) for some c > 0 and any z ∈ R, which implies Rℓ(w) ≥ c Rzo(w). ◮ ℓ′(0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓls(z) := (1 − z)2; note ℓls(yˆ y) = (1 − yˆ y)2 = y2(1 − yˆ y)2 = (y − ˆ y)2. ◮ Logistic loss: ℓlog(z) = ln(1 + exp(−z)).

19 / 61

slide-53
SLIDE 53

(Lec5-6.) Squared and logistic losses on linearly separable data I

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

2 .

  • 8

.

  • 4

. . 4 . 8 . 1 2 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1.200
  • 0.800
  • 0.400

0.000 0.400 0.800 1.200

Logistic loss. Squared loss.

20 / 61

slide-54
SLIDE 54

(Lec5-6.) Squared and logistic losses on linearly separable data II

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

2 .

  • 8

.

  • 4

. . 4 . 8 . 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

. 2

  • .

8

  • .

4 . . 4 . 8 1 . 2

Logistic loss. Squared loss.

21 / 61

slide-55
SLIDE 55

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem.

22 / 61

slide-56
SLIDE 56

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it.

  • Theorem. If there exists ¯

w with yi ¯ wTxi > 0 for all i, then every w with Rlog(w) < ln(2)/2n + infv Rlog(v), also satisfies yiwTxi > 0.

22 / 61

slide-57
SLIDE 57

(Lec5-6.) Logistic risk and separation

If there exists a perfect linear separator, empirical logistic risk minimization should find it.

  • Theorem. If there exists ¯

w with yi ¯ wTxi > 0 for all i, then every w with Rlog(w) < ln(2)/2n + infv Rlog(v), also satisfies yiwTxi > 0.

  • Proof. Omitted.

22 / 61

slide-58
SLIDE 58

(Lec5-6.) Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b.

23 / 61

slide-59
SLIDE 59

(Lec5-6.) Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b. Logistic loss: ◮ Take gradient of Rlog(w) = 1

n

n

i=1 ln(1+exp(yiwTxi)) and set to 0 ???

23 / 61

slide-60
SLIDE 60

(Lec5-6.) Least squares and logistic ERM

Least squares: ◮ Take gradient of Aw − b2, set to 0;

  • btain normal equations ATAw = ATb.

◮ One choice is minimum norm solution A+b. Logistic loss: ◮ Take gradient of Rlog(w) = 1

n

n

i=1 ln(1+exp(yiwTxi)) and set to 0 ???

  • Remark. Is A+b a “closed form expression”?

23 / 61

slide-61
SLIDE 61

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

24 / 61

slide-62
SLIDE 62

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares?

24 / 61

slide-63
SLIDE 63

(Lec5-6.) Gradient descent

Given a function F : Rd → R, gradient descent is the iteration wi+1 := wi − ηi∇wF(wi), where w0 is given, and ηi is a learning rate / step size.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2.000 4.000 6 . 8 . 10.000 12.000 1 4 .

Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity.

24 / 61

slide-64
SLIDE 64

(Lec5-6.) Multiclass?

All our methods so far handle multiclass: ◮ k-nn and decision tree: plurality label. ◮ Least squares: arg min

W ∈Rd×kAW − B2

F with B ∈ Rn×k;

W ∈ Rd×k is k separate linear regressors in Rd. How about linear classifiers? ◮ At prediction time, x → arg maxy ˆ f(x)y. ◮ As in binary case: interpretation f(x)y = Pr[Y = y|X = x]. What is a good loss function?

25 / 61

slide-65
SLIDE 65

(Lec5-6.) Cross-entropy

Given two probability vectors p, q ∈ ∆k = {p ∈ Rk

≥0 : i pi = 1},

H(p, q) = −

k

  • i=1

pi ln qi (cross-entropy). ◮ If p = q, then H(p, q) = H(p) (entropy); indeed H(p, q) = −

k

  • i=1

pi ln

  • pi qi

pi

  • = H(p)

entropy

+ KL(p, q)

  • KL divergence

. Since KL ≥ 0 and moreover 0 iff p = q, this is the cost/entropy of p plus a penalty for differing. ◮ Choose encoding ˜ yi = ey for y ∈ {1, . . . , k}, and ˆ y ∝ exp(f(x)) with f : Rd → Rk; ℓce(˜ y, f(x)) = H(˜ y, ˆ y) = −

k

  • i=1

˜ yi ln   exp(f(x)i) k

j=1 exp(f(x)j)

  = − ln   exp(f(x)y) k

j=1 exp(f(x)j)

  = −f(x)y + ln

k

  • j=1

exp(f(x)j). (In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y).)

26 / 61

slide-66
SLIDE 66

(Lec5-6.) Cross-entropy, classification, and margins

The zero-one loss for classification is ℓzo(yi, f(x)) = 1

  • yi = arg max

j

f(x)j

  • .

In the multiclass case, can define margin as f(x)y − max

j=y f(x)j,

interpreted as “the distance by which f is correct”. (Can be negative!) Since ln

j zj ≈ maxj zj, cross-entropy satisfies

ℓce(˜ yi, f(x)) = −f(x)y + ln

  • j

exp

  • f(x)j
  • ≈ −f(x)y + max

j

f(x)j, thus minimizing cross-entropy maximizes margins.

27 / 61

slide-67
SLIDE 67

(Lec7-8.) The ERM perspective

These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well.

28 / 61

slide-68
SLIDE 68

(Lec7-8.) Basic deep networks

A self-contained expression is x → σL

  • W LσL−1
  • · · ·
  • W 2σ1(W 1x + b1) + b2
  • · · ·
  • + bL
  • ,

with equivalent “functional form” x → (fL ◦ · · · ◦ f1)(x) where fi(z) = σi (W iz + bi) . Some further details (many more to come!): ◮ (W i)L

i=1 with W i ∈ Rdi×di−1 are the weights, and (bi)L i=1 are the biases.

◮ (σi)L

i=1 with σi : Rdi → Rdi are called nonlinearties, or activations, or

transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions!

29 / 61

slide-69
SLIDE 69
slide-70
SLIDE 70

(Lec7-8.) Choices of activation

Basic form: x → σL

  • W LσL−1
  • · · · W 2σ1(W 1x + b1) + b2 · · ·
  • + bL
  • .

Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z → 1[z ≥ 0]. This was the original choice (1940s!). ◮ Sigmoid σs(z) :=

1 1+exp(−z).

This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z → tanh(z). Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σr(z) = max{0, z}. It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z → z; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later.

30 / 61

slide-71
SLIDE 71

(Lec7-8.) “Architectures” and “models”

Basic form: x → σL

  • W LσL−1
  • · · · W 2σ1(W 1x + b1) + b2 · · ·
  • + bL
  • .

((W i, bi))L

i=1, the weights and biases, are the parameters.

Let’s roll them into W := ((W i, bi))L

i=1,

and consider the network as a two-parameter function FW(x) = F(x; W). ◮ The model or class of functions is {FW : all possible W}. F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.)

31 / 61

slide-72
SLIDE 72

(Lec7-8.) ERM recipe for basic networks

Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F(·, ·). ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: arg min

W

1 n

n

  • i=1

ℓce

  • yi, F(xi, W)
  • =

arg min

W 1∈Rd×d1 ,b1∈Rd1

. . .

W L∈RdL−1×dL ,bL∈RdL

1 n

n

  • i=1

ℓce

  • yi, F(xi; ((W i, bi))L

i=1)

  • =

arg min

W 1∈Rd×d1 ,b1∈Rd1

. . .

W L∈RdL−1×dL ,bL∈RdL

1 n

n

  • i=1

ℓce

  • yi, σL(· · · σ1(W 1xi + b1) · · · )
  • .

◮ Then we pick an optimizer. In this class, we only use gradient descent

  • variants. It is a miracle that this works.

32 / 61

slide-73
SLIDE 73

(Lec7-8.) Sometimes, linear just isn’t enough

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3.000
  • 1.500

0.000 1.500 3.000 4.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

2 .

  • 24.000
  • 1

6 .

  • 8.000
  • 8

. 0.000 0.000 8 . 8.000 16.000

Linear predictor: x → wT [ x

1 ].

Some blue points misclassified. ReLU network: x → W 2σr(W 1x + b1) + b2. 0 misclassifications!

33 / 61

slide-74
SLIDE 74

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.)

  • Theorem. On this data, any linear classifier (with affine expansion)

makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes.

34 / 61

slide-75
SLIDE 75

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.)

  • Theorem. On this data, any linear classifier (with affine expansion)

makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them.

34 / 61

slide-76
SLIDE 76

(Lec7-8.) Classical example: XOR

Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.)

  • Theorem. On this data, any linear classifier (with affine expansion)

makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point.

34 / 61

slide-77
SLIDE 77

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : Rd → R and any ǫ > 0, there exist parameters (W 1, b1, W 2) so that sup

x∈[0,1]d

  • f(x) − W 2σ (W 1x + b1)
  • ≤ ǫ,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold).

35 / 61

slide-78
SLIDE 78

(Lec7-8.) One layer was not enough. How about two?

Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : Rd → R and any ǫ > 0, there exist parameters (W 1, b1, W 2) so that sup

x∈[0,1]d

  • f(x) − W 2σ (W 1x + b1)
  • ≤ ǫ,

as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). Remarks. ◮ Together with XOR example, justifies using nonlinearities. ◮ Does not justify (very) deep networks. ◮ Only says these networks exist, not that we can optimize for them!

35 / 61

slide-79
SLIDE 79

(Lec7-8.) General graph-based view

Classical graph-based perspective. ◮ Network is a directed acyclic graph; sources are inputs, sinks are outputs, intermediate nodes compute z → σ(wTz + b) (with their own (σ, w, b)). ◮ Nodes at distance 1 from inputs are the first layer, distance 2 is second layer, and so on. “Modern” graph-based perspective. ◮ Edges in the graph can be multivariate, meaning vectors or general tensors, and not just scalars. ◮ Edges will often “skip” layers; “layer” is therefore ambiguous. ◮ Diagram conventions differ; e.g., tensorflow graphs include nodes for parameters.

36 / 61

slide-80
SLIDE 80

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

slide-81
SLIDE 81

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

slide-82
SLIDE 82

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

slide-83
SLIDE 83

(Lec7-8.) 2-D convolution in deep networks (pictures)

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

slide-84
SLIDE 84

(Lec7-8.) Softmax

Replace vector input z with z′ ∝ ez, meaning z →

  • ez1
  • j ezj , . . . ,

ezk

  • j ezj ,
  • .

◮ Converts input into a probability vector; useful for interpreting output network output as Pr[Y = y|X = x]. ◮ We have baked it into our cross-entropy definition; last lectures networks with cross-entropy training had implicit softmax. ◮ If some coordinate j of z dominates others, then softmax is close to ej.

38 / 61

slide-85
SLIDE 85

(Lec7-8.) Max pooling

2 2 3 3 1 3 2 1 2 2 2 3 1 1 2 3 1

3.0 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

slide-86
SLIDE 86

(Lec7-8.) Max pooling

2 2 3 3 1 3 2 1 2 2 2 3 1 1 2 3 1

3.0 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

slide-87
SLIDE 87

(Lec7-8.) Max pooling

2 2 3 3 1 3 2 1 2 2 2 3 1 1 2 3 1

3.0 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

slide-88
SLIDE 88

(Lec7-8.) Max pooling

2 2 3 3 1 3 2 1 2 2 2 3 1 1 2 3 1

3.0 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0

(Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.)

◮ Often used together with convolution layers; shrinks/downsamples the input. ◮ Another variant is average pooling. ◮ Implementation: torch.nn.MaxPool2d .

39 / 61

slide-89
SLIDE 89

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · · σ1(W 1x + b) · · · ). The multivariate chain rule tells us ∇W F(W x) = J

Tx T,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at W x, the matrix of all coordinate-wise derivatives. dGL dW L = J

T

LGL−1(W)

T

dGL dbL = J

T

L,

40 / 61

slide-90
SLIDE 90

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · · σ1(W 1x + b) · · · ). The multivariate chain rule tells us ∇W F(W x) = J

Tx T,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at W x, the matrix of all coordinate-wise derivatives. dGL dW L = J

T

LGL−1(W)

T

dGL dbL = J

T

L,

. . . dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T,

dGL dbj = (J LW LJ L−1W L−1 · · · J j)

T , 40 / 61

slide-91
SLIDE 91

(Lec9-10.) Multivariate network single-example gradients

Define Gj(W) = σj(W j · · · σ1(W 1x + b) · · · ). The multivariate chain rule tells us ∇W F(W x) = J

Tx T,

and J ∈ Rl×k is the Jacobian matrix of F : Rk → Rl at W x, the matrix of all coordinate-wise derivatives. dGL dW L = J

T

LGL−1(W)

T

dGL dbL = J

T

L,

. . . dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T,

dGL dbj = (J LW LJ L−1W L−1 · · · J j)

T ,

with J j as the Jacobian of σj at W jGj−1(W) + bj. For example, with σj that is coordinate-wise σ : R → R, J j is diag

  • σ′

W jGj−1(W) + bj

  • 1
  • , . . . , σ′
  • W jGj−1(W) + bj
  • dj
  • .

40 / 61

slide-92
SLIDE 92

(Lec9-10.) Initialization

Recall dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T.

◮ What if we set W = 0? What if σ = σr is a ReLU?

41 / 61

slide-93
SLIDE 93

(Lec9-10.) Initialization

Recall dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T.

◮ What if we set W = 0? What if σ = σr is a ReLU? ◮ What if we set two rows of W j (two nodes) identically?

41 / 61

slide-94
SLIDE 94

(Lec9-10.) Initialization

Recall dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T.

◮ What if we set W = 0? What if σ = σr is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking.

41 / 61

slide-95
SLIDE 95

(Lec9-10.) Initialization

Recall dGL dW j = (J LW LJ L−1W L−1 · · · J j)

T Gj−1(W) T.

◮ What if we set W = 0? What if σ = σr is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking. ◮ Standard linear/dense layer initializations: N(0, 2 din ) “He et al.”, N(0, 2 din + dout ) Glorot/Xavier, U(− 1 √din , 1 √din ) torch default. (Convolution layers adjusted to have similar distributions.) Random initialization is emerging as a key story in deep networks!

41 / 61

slide-96
SLIDE 96

(Lec9-10; SGD slide.) Minibatches

We used the linearity of gradients to write ∇w R(w) = 1 n

n

  • i=1

∇wℓ(F(xi; w), yi). What happens if we replace ((xi, yi))n

i=1 with minibatch ((x′ i, y′ i))b i=1?

◮ Random minibatch = ⇒ two gradients equal in expectation. ◮ Most torch layers take minibatch input: ◮ torch.nn.Linear has input shape (b, d), output (b, d′). ◮ torch.nn.Conv2d has input shape (b, c, h, w), output (b, c′, h′, w′). ◮ This is used heavily outside deep learning as well. It is an easy way to use parallel floating point operations (as in GPU and CPU). ◮ Setting batch size is black magic and depends on many things (prediction problem, gpu characteristics, . . . ).

42 / 61

slide-97
SLIDE 97

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x, x′} in S, the line segment between x and x′ is also contained in S. ({x, x′} ∈ S = ⇒ [x, x′] ∈ S.)

convex not convex convex convex

43 / 61

slide-98
SLIDE 98

(Lec9-10.) Convex sets

A set S is convex if, for every pair of points {x, x′} in S, the line segment between x and x′ is also contained in S. ({x, x′} ∈ S = ⇒ [x, x′] ∈ S.)

convex not convex convex convex

Examples: ◮ All of Rd. ◮ Empty set. ◮ Half-spaces: {x ∈ Rd : aTx ≤ b}. ◮ Intersections of convex sets. ◮ Polyhedra:

  • x ∈ Rd : Ax ≤ b
  • = m

i=1

  • x ∈ Rd : aT

i x ≤ bi

  • .

◮ Convex hulls: conv(S) := {k

i=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0, k i=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

43 / 61

slide-99
SLIDE 99

(Lec9-10.) Convex functions from convex sets

The epigraph of a function f is the area above the curve: epi(f) :=

  • (x, y) ∈ Rd+1 : y ≥ f(x)
  • .

A function is convex if its epigraph is convex.

f is not convex f is convex

44 / 61

slide-100
SLIDE 100

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x, x′ ∈ Rd and α ∈ [0, 1], f ((1 − α)x + αx′) ≤ (1 − α) · f(x) + α · f(x′).

f is not convex f is convex x x′ x x′

45 / 61

slide-101
SLIDE 101

(Lec9-10.) Convex functions (standard definition)

A function f : Rd → R is convex if for any x, x′ ∈ Rd and α ∈ [0, 1], f ((1 − α)x + αx′) ≤ (1 − α) · f(x) + α · f(x′).

f is not convex f is convex x x′ x x′

Examples: ◮ f(x) = cx for any c > 0 (on R) ◮ f(x) = |x|c for any c ≥ 1 (on R) ◮ f(x) = bTx for any b ∈ Rd. ◮ f(x) =x for any norm ·. ◮ f(x) = xTAx for symmetric positive semidefinite A. ◮ f(x) = ln d

i=1 exp(xi)

  • , which approximates maxi xi.

45 / 61

slide-102
SLIDE 102

(Lec9-10.) Convexity of differentiable functions

Differentiable functions If f : Rd → R is differentiable, then f is convex if and only if f(x) ≥ f(x0) + ∇f(x0)

T(x − x0)

for all x, x0 ∈ Rd. Note: this implies increasing slopes:

  • ∇f(x) − ∇f(y)

T (x − y) ≥ 0. f(x) x0 a(x) a(x) = f(x0) + f ′(x0)(x − x0)

46 / 61

slide-103
SLIDE 103

(Lec9-10.) Convexity of differentiable functions

Differentiable functions If f : Rd → R is differentiable, then f is convex if and only if f(x) ≥ f(x0) + ∇f(x0)

T(x − x0)

for all x, x0 ∈ Rd. Note: this implies increasing slopes:

  • ∇f(x) − ∇f(y)

T (x − y) ≥ 0. f(x) x0 a(x) a(x) = f(x0) + f ′(x0)(x − x0) Twice-differentiable functions If f : Rd → R is twice-differentiable, then f is convex if and only if ∇2f(x) 0 for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positive semi-definite for all x).

46 / 61

slide-104
SLIDE 104

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem: min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n where f0, f1, . . . , fn : Rd → R are convex functions.

47 / 61

slide-105
SLIDE 105

(Lec9-10.) Convex optimization problems

Standard form of a convex optimization problem: min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n where f0, f1, . . . , fn : Rd → R are convex functions. Fact: the feasible set A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is a convex set.

(SVMs next week will give us an example.)

47 / 61

slide-106
SLIDE 106

(Lec9-10.) Subgradients: Jensen’s inequality.

If f : Rd → R is convex, then Ef(X) ≥ f (EX).

  • Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then

Ef(X) ≥ E

  • f(y) + s

T(X − y)

  • = f(y) + s

TE (X − y) = f(y).

  • Note. This inequality comes up often!

48 / 61

slide-107
SLIDE 107

(Lec11-12.) Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier.

49 / 61

slide-108
SLIDE 108

(Lec11-12.) Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time.

49 / 61

slide-109
SLIDE 109

(Lec11-12.) Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique.

49 / 61

slide-110
SLIDE 110

(Lec11-12.) Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique. We can solve this in a variety of ways (e.g., projected gradient descent); we will work with the dual.

49 / 61

slide-111
SLIDE 111

(Lec11-12.) Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique. We can solve this in a variety of ways (e.g., projected gradient descent); we will work with the dual. Note: Can also explicitly include affine expansion, so decision boundary need not pass through origin. We’ll do our derivations without it.

49 / 61

slide-112
SLIDE 112

(Lec11-12.) SVM Duality summary

Lagrangian L(w, α) = 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w).

Primal maximum margin problem was P(w) = max

α≥0 L(w, α) = max α≥0

 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  . Dual problem D(α) = min

w∈Rd L(w, α) = L

 

n

  • i=1

αiyixi, α   =

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

2

=

n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

Given dual optimum ˆ α, ◮ Corresponding primal optimum ˆ w = n

i=1 αiyixi;

◮ Strong duality P( ˆ w) = D(ˆ α); ◮ ˆ αi > 0 implies yixT

i ˆ

w = 1, and these yixi are support vectors.

50 / 61

slide-113
SLIDE 113

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal: min

w∈Rd

1 2w2 + C

n

  • i=1
  • 1 − yix

T

i w

  • + .

Dual: max

α∈Rn 0≤αi≤C

  

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

  Dual solution ˆ α gives primal solution ˆ w = n

i=1 αiyixi.

51 / 61

slide-114
SLIDE 114

(Lec11-12.) Nonseparable case: bottom line

Unconstrained primal: min

w∈Rd

1 2w2 + C

n

  • i=1
  • 1 − yix

T

i w

  • + .

Dual: max

α∈Rn 0≤αi≤C

  

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

  Dual solution ˆ α gives primal solution ˆ w = n

i=1 αiyixi.

Remarks. ◮ Can take C → ∞ to recover the separable case. ◮ Dual is a constrained convex quadratic (can be solved with projected gradient descent). ◮ Some presentations include bias in primal (xT

i w + b);

this introduces a constraint n

i=1 yiαi = 0 in dual.

◮ Some presentations replace 1

2 and C with λ 2 and 1 n, respectively.

51 / 61

slide-115
SLIDE 115

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xT

i xj.

max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

52 / 61

slide-116
SLIDE 116

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xT

i xj.

max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

If we use feature expansion (e.g., quadratic expansion) x → φ(x), this becomes max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjφ(xi)

Tφ(xj). 52 / 61

slide-117
SLIDE 117

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xT

i xj.

max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

If we use feature expansion (e.g., quadratic expansion) x → φ(x), this becomes max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjφ(xi)

Tφ(xj).

Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiφ(x)

Tφ(xi). 52 / 61

slide-118
SLIDE 118

(Lec11-12.) Looking at the dual again

SVM dual problem only depends on xi through inner products xT

i xj.

max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

If we use feature expansion (e.g., quadratic expansion) x → φ(x), this becomes max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjφ(xi)

Tφ(xj).

Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiφ(x)

Tφ(xi).

Key insight: ◮ Training and prediction only use φ(x)Tφ(x′), never an isolated φ(x); ◮ Sometimes computing φ(x)Tφ(x′) is much easier than computing φ(x).

52 / 61

slide-119
SLIDE 119

(Lec11-12.) Quadratic expansion

◮ φ: Rd → R1+2d+(d

2), where

φ(x) =

  • 1,

√ 2x1, . . . , √ 2xd, x2

1, . . . , x2 d,

√ 2x1x2, . . . , √ 2x1xd, . . . , √ 2xd−1xd

  • (Don’t mind the

√ 2’s. . . )

53 / 61

slide-120
SLIDE 120

(Lec11-12.) Quadratic expansion

◮ φ: Rd → R1+2d+(d

2), where

φ(x) =

  • 1,

√ 2x1, . . . , √ 2xd, x2

1, . . . , x2 d,

√ 2x1x2, . . . , √ 2x1xd, . . . , √ 2xd−1xd

  • (Don’t mind the

√ 2’s. . . ) ◮ Computing φ(x)Tφ(x′) in O(d) time: φ(x)

Tφ(x′) = (1 + x Tx′)2. 53 / 61

slide-121
SLIDE 121

(Lec11-12.) Quadratic expansion

◮ φ: Rd → R1+2d+(d

2), where

φ(x) =

  • 1,

√ 2x1, . . . , √ 2xd, x2

1, . . . , x2 d,

√ 2x1x2, . . . , √ 2x1xd, . . . , √ 2xd−1xd

  • (Don’t mind the

√ 2’s. . . ) ◮ Computing φ(x)Tφ(x′) in O(d) time: φ(x)

Tφ(x′) = (1 + x Tx′)2.

◮ Much better than d2 time.

53 / 61

slide-122
SLIDE 122

(Lec11-12.) Quadratic expansion

◮ φ: Rd → R1+2d+(d

2), where

φ(x) =

  • 1,

√ 2x1, . . . , √ 2xd, x2

1, . . . , x2 d,

√ 2x1x2, . . . , √ 2x1xd, . . . , √ 2xd−1xd

  • (Don’t mind the

√ 2’s. . . ) ◮ Computing φ(x)Tφ(x′) in O(d) time: φ(x)

Tφ(x′) = (1 + x Tx′)2.

◮ Much better than d2 time. ◮ What if we change exponent “2”?

53 / 61

slide-123
SLIDE 123

(Lec11-12.) Quadratic expansion

◮ φ: Rd → R1+2d+(d

2), where

φ(x) =

  • 1,

√ 2x1, . . . , √ 2xd, x2

1, . . . , x2 d,

√ 2x1x2, . . . , √ 2x1xd, . . . , √ 2xd−1xd

  • (Don’t mind the

√ 2’s. . . ) ◮ Computing φ(x)Tφ(x′) in O(d) time: φ(x)

Tφ(x′) = (1 + x Tx′)2.

◮ Much better than d2 time. ◮ What if we change exponent “2”? ◮ What if we replace additive “1” with 0?

53 / 61

slide-124
SLIDE 124

(Lec11-12; RBF kernel.) Infinite dimensional feature expansion

For any σ > 0, there is an infinite feature expansion φ: Rd → R∞ such that φ(x)

Tφ(x′) = exp

 −

  • x − x′

2

2

2σ2   , which can be computed in O(d) time. (This is called the Gaussian kernel with bandwidth σ.)

54 / 61

slide-125
SLIDE 125

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric function satisfying: For any x1, x2, . . . , xn ∈ X, the n×n matrix whose (i, j)-th entry is K(xi, xj) is positive semidefinite. (This matrix is called the Gram matrix.)

55 / 61

slide-126
SLIDE 126

(Lec11-12.) Kernels

A (positive definite) kernel function K : X × X → R is a symmetric function satisfying: For any x1, x2, . . . , xn ∈ X, the n×n matrix whose (i, j)-th entry is K(xi, xj) is positive semidefinite. (This matrix is called the Gram matrix.) For any kernel K, there is a feature mapping φ: X → H such that φ(x)

Tφ(x′) = K(x, x′).

H is a Hilbert space—i.e., a special kind of inner product space—called the Reproducing Kernel Hilbert Space corresponding to K.

55 / 61

slide-127
SLIDE 127

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjK(xi, xj).

56 / 61

slide-128
SLIDE 128

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjK(xi, xj). Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiK(x, xi).

56 / 61

slide-129
SLIDE 129

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjK(xi, xj). Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiK(x, xi). ◮ To represent classifier, need to keep support vector examples (xi, yi) and corresponding ˆ αi’s.

56 / 61

slide-130
SLIDE 130

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjK(xi, xj). Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiK(x, xi). ◮ To represent classifier, need to keep support vector examples (xi, yi) and corresponding ˆ αi’s. ◮ To compute prediction on x, iterate through support vector examples and compute K(x, xi) for each support vector xi . . .

56 / 61

slide-131
SLIDE 131

(Lec11-12.) Kernel SVMs (Boser, Guyon, and Vapnik, 1992)

Solve max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjK(xi, xj). Solution ˆ w = n

i=1 ˆ

αiyiφ(xi) is used in the following way: x → φ(x)

T ˆ

w =

n

  • i=1

ˆ αiyiK(x, xi). ◮ To represent classifier, need to keep support vector examples (xi, yi) and corresponding ˆ αi’s. ◮ To compute prediction on x, iterate through support vector examples and compute K(x, xi) for each support vector xi . . . Very similar to nearest neighbor classifier: predictor is represented using (a subset of) the training data.

56 / 61

slide-132
SLIDE 132

(Lec11-12.) Nonlinear support vector machines (again)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 32.000
  • 24.000
  • 1

6 .

  • 8.000
  • 8.000

. . 8.000 8.000 16.000

ReLU network.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.500
  • 1

.

  • 7.500
  • 5.000
  • 2.500
  • 2

. 5 . 0.000 2 . 5

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

.

  • 2

.

  • 1.000
  • 1

. . 0.000 1.000 2.000 3.000

RBF SVM (σ = 1).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 1

. 5

  • 1.000
  • 1

.

  • 1.000
  • 0.500
  • 0.500

. . 0.500 0.500 1 . 1 . 5

RBF SVM (σ = 0.1).

57 / 61

slide-133
SLIDE 133

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks.

58 / 61

slide-134
SLIDE 134

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

58 / 61

slide-135
SLIDE 135

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx.

58 / 61

slide-136
SLIDE 136

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx).

58 / 61

slide-137
SLIDE 137

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx). Scenario 1

ˆ wt xt yt = 1 xT

t ˆ

wt ≤ 0

Current vector ˆ wt comparble to xt in length.

58 / 61

slide-138
SLIDE 138

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx). Scenario 1

ˆ wt xt yt = 1 xT

t ˆ

wt ≤ 0 ˆ wt+1

Updated vector ˆ wt+1 now correctly classifies (xt, yt).

58 / 61

slide-139
SLIDE 139

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx). Scenario 2

ˆ wt xt yt = 1 xT

t ˆ

wt ≤ 0

Current vector ˆ wt much longer than xt.

58 / 61

slide-140
SLIDE 140

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx). Scenario 2

ˆ wt xt yt = 1 xT

t ˆ

wt ≤ 0 ˆ wt+1

Updated vector ˆ wt+1 does not correctly classify (xt, yt).

58 / 61

slide-141
SLIDE 141

(Lec13.) The Perceptron Algorithm

Perceptron update (Rosenblatt ’58): initialize w := 0, and thereafter w ← w + 1[yw

Tx ≤ 0]yx.

Remarks. ◮ Can interpret algorithm as: either we are correct with a margin (ywTx > 0) and we do nothing,

  • r we are not and update w ← w + yx.

◮ Therefore: if we update, we do so by rotating towards yx. ◮ This makes sense: (w + yx)T(yx) = wT(yx) + x2; i.e., we increase wT(yx). Scenario 2

ˆ wt xt yt = 1 xT

t ˆ

wt ≤ 0 ˆ wt+1

Updated vector ˆ wt+1 does not correctly classify (xt, yt). Not obvious that Perceptron will eventually terminate! (We’ll return to this.)

58 / 61

slide-142
SLIDE 142
  • 2. Homework review
slide-143
SLIDE 143

Nice homework problems

◮ HW1-P3: SVD practice. ◮ HW2-P1: SVD practice, definition A2 and AF. ◮ HW3-P1: convexity practice. ◮ HW3-P2: deep network and probability practice. ◮ HW3-P4, HW3-P5: kernel practice.

59 / 61

slide-144
SLIDE 144
  • 3. Stuff added 3/12/2019
slide-145
SLIDE 145

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |a

Tb| ≤ a · b. 60 / 61

slide-146
SLIDE 146

Belaboring Cauchy-Schwarz

Theorem (Cauchy-Schwarz). |a

Tb| ≤ a · b.

Proof (another one. . . ). If either a or b are zero, then aTb = 0 = a · b. If both a and b are nonzero, note for any r > 0 that 0 ≤

  • ra − b

r

  • 2

= r2a2 − 2a

Tb + b2

r2 , which can be rearranged into a

Tb ≤ r2a2

2 + b2 2r2 . Choosing r =

  • b/a,

a

Tb ≤ a2

2 b a

  • + b2

2 a b

  • = a · b.

Lastly, applying the bound to (a, −b) gives −a

Tb = a T (−b) ≤ a · − b = a · b,

so together it follows that

  • a

Tb

  • ≤ a · b.
  • 60 / 61
slide-147
SLIDE 147

SVD again

61 / 61

slide-148
SLIDE 148

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.

61 / 61

slide-149
SLIDE 149

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

61 / 61

slide-150
SLIDE 150

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.

61 / 61

slide-151
SLIDE 151

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

61 / 61

slide-152
SLIDE 152

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively).

61 / 61

slide-153
SLIDE 153

SVD again

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively). Personally: I internalize SVD and use it to reason about matrices. E.g., “rank-nullity theorem”.

61 / 61

slide-154
SLIDE 154
slide-155
SLIDE 155
slide-156
SLIDE 156
slide-157
SLIDE 157