[PPT] - Random Projections, Margins, Kernels and Feature Selection Adithya PowerPoint Presentation

SLIDE 1

Random Projections, Margins, Kernels and Feature Selection

Adithya Pediredla Rice University Electrical and Computer Engineering

1

SLIDE 2

SVM: Revision

f (xi) = wTxi + b

2

SLIDE 3

SVM: Revision

f (xi) = wTxi + b Primal: min

w∈Rd w2 + C N

i

max(0, 1 − yif (xi));

2

SLIDE 4

SVM: Revision

f (xi) = wTxi + b Primal: min

w∈Rd w2 + C N

i

max(0, 1 − yif (xi)); Dual: max

αi≥0

i

αi − 1 2

j,k

αiαjyjyk(xT

j xk);

S.T. 0 ≤ αi ≤ C;

i

αiyi = 0, ∀i

2

SLIDE 5

SVM: Revision

f (xi) = wTxi + b Primal: min

w∈Rd w2 + C N

i

max(0, 1 − yif (xi)); Dual: max

αi≥0

i

αi − 1 2

j,k

αiαjyjyk(xT

j xk);

S.T. 0 ≤ αi ≤ C;

i

αiyi = 0, ∀i

nly inner products matter

2

SLIDE 6

SVM: Revision

f (xi) = wTxi + b Primal: min

w∈Rd w2 + C N

i

max(0, 1 − yif (xi)); O(nd2 + d3) Dual: max

αi≥0

i

αi − 1 2

j,k

αiαjyjyk(xT

j xk); O(dn2 + n3)

S.T. 0 ≤ αi ≤ C;

i

αiyi = 0, ∀i

nly inner products matter

2

SLIDE 7

Decreasing computations

Only inner products matter.

3

SLIDE 8

Decreasing computations

Only inner products matter. Can we approximate xi with zi so that dim(zi) << dim(xi) and xT

i xj ≈ zT i zj.

3

SLIDE 9

Decreasing computations

Only inner products matter. Can we approximate xi with zi so that dim(zi) << dim(xi) and xT

i xj ≈ zT i zj.

One way zi = Axi.

3

SLIDE 10

Decreasing computations

Only inner products matter. Can we approximate xi with zi so that dim(zi) << dim(xi) and xT

i xj ≈ zT i zj.

One way zi = Axi. Any comment on rows vs columns of A.

3

SLIDE 11

Decreasing computations

Only inner products matter. Can we approximate xi with zi so that dim(zi) << dim(xi) and xT

i xj ≈ zT i zj.

One way zi = Axi. Any comment on rows vs columns of A. Turns out a random A is good !!

3

SLIDE 12

Johnson-Linderstrauss Lemma

If dnew = ω( 1

γ2 log n), relative angles are preserved up to 1 ± γ.

4

SLIDE 13

Johnson-Linderstrauss Lemma

If dnew = ω( 1

γ2 log n), relative angles are preserved up to 1 ± γ.

How big can γ be?

4

SLIDE 14

which data set can have higher γ

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

5

SLIDE 15

which data set can have higher γ

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

6

SLIDE 16

which data set can have higher γ

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

20
15
10
5

5 10 15 20

7

SLIDE 17

How else can big margin help

A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. If error < 1

2 − γ 4, stop

else, goto step 1.

8

SLIDE 18

How else can big margin help

A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. If error < 1

2 − γ 4, stop

else, goto step 1. Bigger the margin, lesser the iterations

8

SLIDE 19

Dimensionality reduction: random projection

Coming back to random projection. Ad×D

1 Choose columns to be D random orthogonal unit-length vectors. 9

SLIDE 20

Dimensionality reduction: random projection

Coming back to random projection. Ad×D

1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 9

SLIDE 21

Dimensionality reduction: random projection

Coming back to random projection. Ad×D

1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. 9

SLIDE 22

Dimensionality reduction: random projection

Coming back to random projection. Ad×D

1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random.

For (2) and (3): PrA[(1 − γ)u − v2 ≤ u′ − v′2 ≤ (1 + γ)u − v2] ≥ 1 − 2e−(γ2−γ3) d

4 9

SLIDE 23

Dimensionality reduction: random projection

Coming back to random projection. Ad×D

1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random.

For (2) and (3): PrA[(1 − γ)u − v2 ≤ u′ − v′2 ≤ (1 + γ)u − v2] ≥ 1 − 2e−(γ2−γ3) d

4

Can we do better?

9

SLIDE 24

Can we do better

If Pr(error < ǫ) < δ

10

SLIDE 25

Can we do better

If Pr(error < ǫ) < δ d = O( 1

γ2 log( 1 ǫδ)) is sufficient.

10

SLIDE 26

Kernel functions

What if we know that K(x1, x2) = φ(x1)φ(x2)?

11

SLIDE 27

Kernel functions

What if we know that K(x1, x2) = φ(x1)φ(x2)? What if we do not?

11

SLIDE 28

Kernel functions

What if we know that K(x1, x2) = φ(x1)φ(x2)? What if we do not? Finding Inner products approximately is enough

11

SLIDE 29

Kernel functions

What if we know that K(x1, x2) = φ(x1)φ(x2)? What if we do not? Finding Inner products approximately is enough We need to know the distribution of data set

11

SLIDE 30

Mapping-1

Lemma: Consider any distribution over labelled data.

12

SLIDE 31

Mapping-1

Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P[w · x > γ] = 0.

12

SLIDE 32

Mapping-1

Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P[w · x > γ] = 0. If we draw z1, z2, . . . zd iid with d ≥ 8 ǫ 1 γ2 + ln 1 δ

then with

probability ≥ 1 − δ, ∃ w′ = span(z1, z2, . . . , zd) ∋ P[w′ · x > γ/2] < ǫ

12

SLIDE 33

Mapping-1

Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P[w · x > γ] = 0. If we draw z1, z2, . . . zd iid with d ≥ 8 ǫ 1 γ2 + ln 1 δ

then with

probability ≥ 1 − δ, ∃ w′ = span(z1, z2, . . . , zd) ∋ P[w′ · x > γ/2] < ǫ Therefore, if ∃w in φ−space, by sampling x1, x2, . . . xn, we are guaranteed: w′ = α1φ(x1) + α2φ(x2) + · · · + αdφ(xd) Hence, w′φ(x) = α1K(x, x1) + α2K(x, x2) + . . . αdK(x, xd);

12

SLIDE 34

Mapping-1

Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P[w · x > γ] = 0. If we draw z1, z2, . . . zd iid with d ≥ 8 ǫ 1 γ2 + ln 1 δ

then with

probability ≥ 1 − δ, ∃ w′ = span(z1, z2, . . . , zd) ∋ P[w′ · x > γ/2] < ǫ Therefore, if ∃w in φ−space, by sampling x1, x2, . . . xn, we are guaranteed: w′ = α1φ(x1) + α2φ(x2) + · · · + αdφ(xd) Hence, w′φ(x) = α1K(x, x1) + α2K(x, x2) + . . . αdK(x, xd); If we define F1(x) = (K(x, x1), . . . , K(x, xd)); then with high probability the vector (α1, . . . αd) is an approximate linear separator.

12

SLIDE 35

Mapping-2

We can normalize K(x, xi) and get better bounds.

13

SLIDE 36

Mapping-2

We can normalize K(x, xi) and get better bounds. Compute K = UTU;

13

SLIDE 37

Mapping-2

We can normalize K(x, xi) and get better bounds. Compute K = UTU; Compute F2(x) = F1(x)U−1.

13

SLIDE 38

Mapping-2

We can normalize K(x, xi) and get better bounds. Compute K = UTU; Compute F2(x) = F1(x)U−1. F2 is linearly separable with error at most ǫ at margin γ/2

13

SLIDE 39

Key take aways

Inner products are enough. Random projections are good. Higher the margin, lower the dimension. If okay with error, we can project to much lower dimension. While using Kernels, randomly drawn data points act as good features.

14