Unit 2: Natural Language Learning Unsupervised Learning (EM, - - PowerPoint PPT Presentation

unit 2 natural language learning
SMART_READER_LITE
LIVE PREVIEW

Unit 2: Natural Language Learning Unsupervised Learning (EM, - - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 2: Natural Language Learning Unsupervised Learning (EM, forward-backward, inside-outside) Liang Huang liang.huang.sh@gmAYl.com Review of Noisy-Channel Model CS 562 - EM 2 Example 1:


slide-1
SLIDE 1

Natural Language Processing Spring 2017

Liang Huang liang.huang.sh@gmAYl.com

Unit 2: Natural Language Learning

Unsupervised Learning (EM, forward-backward, inside-outside)

slide-2
SLIDE 2

CS 562 - EM

Review of Noisy-Channel Model

2

slide-3
SLIDE 3

CS 562 - EM

Example 1: Part-of-Speech Tagging

  • use tag bigram as

a language model

  • channel model is

context-indep.

3

slide-4
SLIDE 4

CS 562 - EM

Ideal vs. AvAYlable Data

4

ideal avAYlable

slide-5
SLIDE 5

CS 562 - EM

Ideal vs. AvAYlable Data

5

HW2: ideal HW4: realistic

EY B AH L A B E R U 1 2 3 4 4 AH B AW T A B A U T O 1 2 3 3 4 4 AH L ER T A R A A T O 1 2 3 3 4 4 EY S E E S U 1 1 2 2 EY B AH L A B E R U AH B AW T A B A U T O AH L ER T A R A A T O EY S E E S U

slide-6
SLIDE 6

CS 562 - EM

Incomplete Data / Model

6

slide-7
SLIDE 7

CS 562 - EM

EM: Expectation-Maximization

7

slide-8
SLIDE 8

CS 562 - EM

How to Change m? 1) Hard

8

slide-9
SLIDE 9

CS 562 - EM

How to Change m? 1) Hard

9

slide-10
SLIDE 10

CS 562 - EM

How to Change m? 2) Soft

10

slide-11
SLIDE 11

CS 562 - EM

Fractional Counts

  • distribution over all possible hallucinated hidden variables
  • W AY N

W A I N

11

W AY N | | / \ W A I N W AY N | |\ \ W A I N W AY N |\ \ \ W A I N

hard-EM counts 1 0 0

AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333

fractional counts 0.333 0.333 0.333 regenerate: 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 fractional counts 0.25 0.5 0.25

AY|-> A I: 0.500 A: 0.250 I: 0.250 W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250

eventually ... 0 ... 1 ... 0

slide-12
SLIDE 12

CS 562 - EM

Is EM magic? well, sort of...

  • how about

W EH T W E T O B IY B IY | |\ |\ \ B I I B I I

  • so EM can possibly: (1) learn something correct

(2) learn something wrong (3) doesn’t learn anything

  • but with lots of data => likely to learn something good

12

slide-13
SLIDE 13

CS 562 - EM

EM: slow version (non-DP)

  • initialize the conditional prob. table to uniform
  • repeat until converged:
  • E-step:
  • for each training example x (here: (e...e, j...j) pAYr):
  • for each hidden z: compute p(x, z) from the current model
  • p(x) = sumz p(x, z); [debug: corpus prob p(data) *= p(x)]
  • for each hidden z = (z1 z2 ... zn): for each i:
  • #(zi) += p(x, z) / p(x); #(LHS(zi)) += p(x, z) / p(x)
  • M-step: count-n-divide on fraccounts => new model
  • p(RHS(zi) | LHS(zi)) = #(zi) / #(LHS(zi))

13

W AY N | |\ \ W A I N z’ W AY N |\ \ \ W A I N z’’ W AY N | | /\ W A I N z

(z1 z2 z3)

p(A I|AY)=#(AY->A I)/#(AY)

slide-14
SLIDE 14

CS 562 - EM

EM: slow version (non-DP)

  • distribution over all possible hallucinated hidden variables
  • W AY N

W A I N

14

W AY N | | / \ W A I N W AY N | |\ \ W A I N W AY N |\ \ \ W A I N

AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333

fractional counts 1/3 1/3 1/3 regenerate p(x,z): 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 renormalize by p(x) = 2/27 + 4/27 + 2/27 = 8/27 fractional counts 1/4 1/2 1/4

AY|-> A I: 0.500 A: 0.250 I: 0.250 W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250

regenerate p(x,z): 3/4*1/4*1/4 3/4*1/2*3/4 1/4*1/4*3/4 renormalize by p(x) = 3/64 + 18/64 + 3/64 = 3/8 fractional counts 1/8 3/4 1/8

++

slide-15
SLIDE 15

CS 562 - EM

EM: fast version (DP)

  • initialize the conditional prob. table to uniform
  • repeat until converged:
  • E-step:
  • for each training example x (here: (e...e, j...j) pAYr):
  • forward from s to t; note: forw[t] = p(x) = sumz p(x, z)
  • backward from t to s; note: back[t]=1; back[s] = forw[t]
  • for each edge (u, v) in the DP graph with label(u, v) = zi
  • fraccount(zi) += forw[u] * back[v] * prob(u, v) / p(x)
  • M-step: count-n-divide on fraccounts => new model

15

sumz: (u, v) in z p(x, z)

forw[u]

back[v]

u v t s

forw[t] = back[s] = p(x) = sumz p(x, z)

slide-16
SLIDE 16

CS 562 - EM

inside-outside: PCFG, SCFG, ...

How to avoid enumeration?

  • dynamic programming: the forward-backward algorithm
  • forward is just like

Viterbi, replacing max by sum

  • backward is like reverse

Viterbi (also with sum)

16

POS tagging, crypto, ... alignment, edit-distance, ...

slide-17
SLIDE 17

CS 562 - EM

Example Forward Code

  • for HW5. this example shows forward only.

17

n, m = len(eprons), len(jprons) forward[0][0] = 1 for i in xrange(0, n): epron = eprons[i] for j in forward[i]: for k in range(1, min(m-j, 3)+1): jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score totalprob *= forward[n][m]

W A I N

W AY N

1 2 3 4 1 2 3

slide-18
SLIDE 18

CS 562 - EM

Example Forward Code

  • for HW5. this example shows forward only.

18

n, m = len(eprons), len(jprons) forward[0][0] = 1 for i in xrange(0, n): epron = eprons[i] for j in forward[i]: for k in range(1, min(m-j, 3)+1): jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score totalprob *= forward[n][m]

... ...

A I

... ...

AY

forw forw[i][ rw[i][j] j] forw forw[i][ rw[i][j] j]

back[ back[i+1][ [i+1][j+k] back[ back[i+1][ [i+1][j+k]

j+k

m n

j i i+1

forw[u]

back[v]

u v t s u v s t

forw[s] = back[t] = 1.0 forw[t] = back[s] = p(x)

slide-19
SLIDE 19

CS 562 - EM

EM: fast version (DP)

  • initialize the conditional prob. table to uniform
  • repeat until converged:
  • E-step:
  • for each trAYning example x (here: (e...e, j...j) pAYr):
  • forward from s to t; note: forw[t] = p(x) = sumz p(x, z)
  • backward from t to s; note: back[t]=1; back[s] = forw[t]
  • for each edge (u, v) in the DP graph with label(u, v) = zi
  • fraccount(zi) += forw[u] * back[v] * prob(u, v) / p(x)
  • M-step: count-n-divide on fraccounts => new model

19

sumz: (u, v) in z p(x, z)

forw[u]

back[v]

u v t s

forw[t] = back[s] = p(x) = sumz p(x, z)

slide-20
SLIDE 20

CS 562 - EM

EM

20

slide-21
SLIDE 21

CS 562 - EM

Why EM increases p(data) iteratively?

21

slide-22
SLIDE 22

CS 562 - EM

Why EM increases p(data) iteratively?

22

convex auxiliary function converge to local maxima

KL-divergence

slide-23
SLIDE 23

CS 562 - EM

How to maximize the auxiliary?

23

W AY N | |\ \ W A I N

p(z’|x)=0.3

W AY N |\ \ \ W A I N

p(z’’|x)=0.2

W AY N | | /\ W A I N

p(z|x)=0.5

just count-n-divide on the fractional data!

(as if MLE on complete data)

W AY N | |\ \ W A I N

3x

W AY N |\ \ \ W A I N

2x

W AY N | | /\ W A I N

5x