[PPT] - Unit 2: Natural Language Learning Unsupervised Learning (EM, PowerPoint Presentation

SLIDE 1

Natural Language Processing Spring 2017

Liang Huang liang.huang.sh@gmAYl.com

Unit 2: Natural Language Learning

Unsupervised Learning (EM, forward-backward, inside-outside)

SLIDE 2

CS 562 - EM

Review of Noisy-Channel Model

2

SLIDE 3

CS 562 - EM

Example 1: Part-of-Speech Tagging

use tag bigram as

a language model

channel model is

context-indep.

3

SLIDE 4

CS 562 - EM

Ideal vs. AvAYlable Data

4

ideal avAYlable

SLIDE 5

CS 562 - EM

Ideal vs. AvAYlable Data

5

HW2: ideal HW4: realistic

EY B AH L A B E R U 1 2 3 4 4 AH B AW T A B A U T O 1 2 3 3 4 4 AH L ER T A R A A T O 1 2 3 3 4 4 EY S E E S U 1 1 2 2 EY B AH L A B E R U AH B AW T A B A U T O AH L ER T A R A A T O EY S E E S U

SLIDE 6

CS 562 - EM

Incomplete Data / Model

6

SLIDE 7

CS 562 - EM

EM: Expectation-Maximization

7

SLIDE 8

CS 562 - EM

How to Change m? 1) Hard

8

SLIDE 9

CS 562 - EM

How to Change m? 1) Hard

9

SLIDE 10

CS 562 - EM

How to Change m? 2) Soft

10

SLIDE 11

CS 562 - EM

Fractional Counts

distribution over all possible hallucinated hidden variables
W AY N

W A I N

11

hard-EM counts 1 0 0

AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333

fractional counts 0.333 0.333 0.333 regenerate: 2/31/31/3 2/31/32/3 1/31/32/3 fractional counts 0.25 0.5 0.25

AY|-> A I: 0.500 A: 0.250 I: 0.250 W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250

eventually ... 0 ... 1 ... 0

SLIDE 12

CS 562 - EM

Is EM magic? well, sort of...

how about

W EH T W E T O B IY B IY | |\ |\ \ B I I B I I

so EM can possibly: (1) learn something correct

(2) learn something wrong (3) doesn’t learn anything

but with lots of data => likely to learn something good

12

SLIDE 13

CS 562 - EM

EM: slow version (non-DP)

initialize the conditional prob. table to uniform
repeat until converged:
E-step:
for each training example x (here: (e...e, j...j) pAYr):
for each hidden z: compute p(x, z) from the current model
p(x) = sumz p(x, z); [debug: corpus prob p(data) *= p(x)]
for each hidden z = (z1 z2 ... zn): for each i:
#(zi) += p(x, z) / p(x); #(LHS(zi)) += p(x, z) / p(x)
M-step: count-n-divide on fraccounts => new model
p(RHS(zi) | LHS(zi)) = #(zi) / #(LHS(zi))

13

(z1 z2 z3)

p(A I|AY)=#(AY->A I)/#(AY)

SLIDE 14

CS 562 - EM

EM: slow version (non-DP)

distribution over all possible hallucinated hidden variables
W AY N

W A I N

14

AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333

fractional counts 1/3 1/3 1/3 regenerate p(x,z): 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 renormalize by p(x) = 2/27 + 4/27 + 2/27 = 8/27 fractional counts 1/4 1/2 1/4

AY|-> A I: 0.500 A: 0.250 I: 0.250 W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250

regenerate p(x,z): 3/4*1/4*1/4 3/4*1/2*3/4 1/4*1/4*3/4 renormalize by p(x) = 3/64 + 18/64 + 3/64 = 3/8 fractional counts 1/8 3/4 1/8

++

SLIDE 15

CS 562 - EM

EM: fast version (DP)

initialize the conditional prob. table to uniform
repeat until converged:
E-step:
for each training example x (here: (e...e, j...j) pAYr):
forward from s to t; note: forw[t] = p(x) = sumz p(x, z)
backward from t to s; note: back[t]=1; back[s] = forw[t]
for each edge (u, v) in the DP graph with label(u, v) = zi
fraccount(zi) += forw[u] * back[v] * prob(u, v) / p(x)
M-step: count-n-divide on fraccounts => new model

15

sumz: (u, v) in z p(x, z)

forw[u]

back[v]

u v t s

forw[t] = back[s] = p(x) = sumz p(x, z)

SLIDE 16

CS 562 - EM

inside-outside: PCFG, SCFG, ...

How to avoid enumeration?

dynamic programming: the forward-backward algorithm
forward is just like

Viterbi, replacing max by sum

backward is like reverse

Viterbi (also with sum)

16

POS tagging, crypto, ... alignment, edit-distance, ...

SLIDE 17

CS 562 - EM

Example Forward Code

for HW5. this example shows forward only.

17

n, m = len(eprons), len(jprons) forward[0][0] = 1 for i in xrange(0, n): epron = eprons[i] for j in forward[i]: for k in range(1, min(m-j, 3)+1): jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score totalprob *= forward[n][m]

W A I N

W AY N

1 2 3 4 1 2 3

SLIDE 18

CS 562 - EM

Example Forward Code

for HW5. this example shows forward only.

18

n, m = len(eprons), len(jprons) forward[0][0] = 1 for i in xrange(0, n): epron = eprons[i] for j in forward[i]: for k in range(1, min(m-j, 3)+1): jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score totalprob *= forward[n][m]

... ...

A I

... ...

AY

forw forw[i][ rw[i][j] j] forw forw[i][ rw[i][j] j]

back[ back[i+1][ [i+1][j+k] back[ back[i+1][ [i+1][j+k]

j+k

m n

j i i+1

forw[u]

back[v]

u v t s u v s t

forw[s] = back[t] = 1.0 forw[t] = back[s] = p(x)

SLIDE 19

CS 562 - EM

EM: fast version (DP)

initialize the conditional prob. table to uniform
repeat until converged:
E-step:
for each trAYning example x (here: (e...e, j...j) pAYr):
forward from s to t; note: forw[t] = p(x) = sumz p(x, z)
backward from t to s; note: back[t]=1; back[s] = forw[t]
for each edge (u, v) in the DP graph with label(u, v) = zi
fraccount(zi) += forw[u] * back[v] * prob(u, v) / p(x)
M-step: count-n-divide on fraccounts => new model

19

sumz: (u, v) in z p(x, z)

forw[u]

back[v]

u v t s

forw[t] = back[s] = p(x) = sumz p(x, z)

SLIDE 20

CS 562 - EM

EM

20

SLIDE 21

CS 562 - EM

Why EM increases p(data) iteratively?

21

SLIDE 22

CS 562 - EM

Why EM increases p(data) iteratively?

22

convex auxiliary function converge to local maxima

KL-divergence

SLIDE 23

CS 562 - EM

How to maximize the auxiliary?

23

W AY N | |\ \ W A I N

p(z’|x)=0.3

W AY N |\ \ \ W A I N

p(z’’|x)=0.2

W AY N | | /\ W A I N

p(z|x)=0.5

just count-n-divide on the fractional data!

(as if MLE on complete data)

W AY N | |\ \ W A I N

3x

W AY N |\ \ \ W A I N

2x

W AY N | | /\ W A I N

5x

Natural Language Processing Spring 2017

Unit 2: Natural Language Learning

Unsupervised Learning (EM, forward-backward, inside-outside)

Review of Noisy-Channel Model

Example 1: Part-of-Speech Tagging

a language model

context-indep.

Ideal vs. AvAYlable Data

ideal avAYlable

Ideal vs. AvAYlable Data

HW2: ideal HW4: realistic

Incomplete Data / Model

EM: Expectation-Maximization

How to Change m? 1) Hard

How to Change m? 1) Hard

How to Change m? 2) Soft

Fractional Counts

hard-EM counts 1 0 0

fractional counts 0.333 0.333 0.333 regenerate: 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 fractional counts 0.25 0.5 0.25

eventually ... 0 ... 1 ... 0

Is EM magic? well, sort of...

(2) learn something wrong (3) doesn’t learn anything

EM: slow version (non-DP)

EM: slow version (non-DP)

EM: fast version (DP)

forw[u]

How to avoid enumeration?

Viterbi, replacing max by sum

Viterbi (also with sum)

Example Forward Code

W A I N

W AY N

Example Forward Code

... ...

A I

... ...

AY

forw[u]

EM: fast version (DP)

forw[u]

EM

Why EM increases p(data) iteratively?

Why EM increases p(data) iteratively?

How to maximize the auxiliary?

just count-n-divide on the fractional data!

fractional counts 0.333 0.333 0.333 regenerate: 2/31/31/3 2/31/32/3 1/31/32/3 fractional counts 0.25 0.5 0.25