Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

β–Ά
lecture 8 word clustering
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2


slide-1
SLIDE 1

Lecture 8: Word Clustering

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

v Brown Clustering

2 6501 Natural Language Processing

slide-3
SLIDE 3

Brown Clustering

v Similar to language model But, basic unit is β€œword clusters”

v Intuition: again, similar words appear in similar context

v Recap: Bigram Language Models

v 𝑄 π‘₯#, π‘₯%, π‘₯&, … , π‘₯( = 𝑄 π‘₯% π‘₯# 𝑄 π‘₯& π‘₯% … 𝑄 π‘₯( π‘₯(+% = Ξ -.%

/

P(w3 ∣ π‘₯3+%)

3 6501 Natural Language Processing

π‘₯# is a dummy word representing ”begin of a sentence”

slide-4
SLIDE 4

Motivation example

v ”a dog is chasing a cat”

v 𝑄 π‘₯#, β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯# 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏”

v Assume Every word belongs to a cluster

4 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting…

slide-5
SLIDE 5

Motivation example

v Assume every word belongs to a cluster vβ€œa dog is chasing a cat”

5 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46

slide-6
SLIDE 6

Motivation example

v Assume every word belongs to a cluster vβ€œa dog is chasing a cat”

6 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

slide-7
SLIDE 7

Motivation example

v Assume every word belongs to a cluster vβ€œthe boy is following a rabbit”

7 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 the boy is following a rabbit

slide-8
SLIDE 8

Motivation example

v Assume every word belongs to a cluster vβ€œa fox was chasing a bird”

8 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a fox was chasing a bird

slide-9
SLIDE 9

Brown Clustering

v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to vβ€œa dog is chasing a cat”

9 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

slide-10
SLIDE 10

Brown clustering model

v P(β€œa dog is chasing a cat”)

= P(C(β€œa”)|𝐷#) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))...

10 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

slide-11
SLIDE 11

Brown clustering model

v P(β€œa dog is chasing a cat”)

= P(C(β€œa”)|𝐷#) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general

𝑄 π‘₯#, π‘₯%, π‘₯&, … , π‘₯( = 𝑄 𝐷(π‘₯%) 𝐷 π‘₯# 𝑄 𝐷(π‘₯&) 𝐷(π‘₯%) … 𝑄 𝐷 π‘₯( 𝐷 π‘₯(+% 𝑄(π‘₯%|𝐷 π‘₯% 𝑄 π‘₯& 𝐷 π‘₯& … 𝑄(π‘₯(|𝐷 π‘₯( ) = Ξ -.%

/

P 𝐷 w3 𝐷 π‘₯3+% 𝑄(π‘₯3 ∣ 𝐷 π‘₯3 )

11 6501 Natural Language Processing

slide-12
SLIDE 12

Model parameters

𝑄 π‘₯#, π‘₯%, π‘₯&, … , π‘₯( = Ξ -.%

/

P 𝐷 w3 𝐷 π‘₯3+% 𝑄(π‘₯3 ∣ 𝐷 π‘₯3 )

12 6501 Natural Language Processing Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

Parameter set 1: 𝑄(𝐷(π‘₯3)|𝐷 π‘₯3+% ) Parameter set 2: 𝑄(π‘₯3|𝐷 π‘₯3 ) Parameter set 3: 𝐷 π‘₯3

slide-13
SLIDE 13

Model parameters

𝑄 π‘₯#, π‘₯%, π‘₯&, … , π‘₯( = Ξ -.%

/

P 𝐷 w3 𝐷 π‘₯3+% 𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 }

v A partition of vocabulary into k classes

v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑J ∈ 1,… , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑J ∈ 1,… , 𝑙 ,π‘₯ ∈ 𝑑

13 6501 Natural Language Processing

πœ„ represents the set of conditional probability parameters C represents the clustering

slide-14
SLIDE 14

Log likelihood

LL(πœ„, 𝐷 ) = log 𝑄 π‘₯#, π‘₯%, π‘₯&,… , π‘₯( πœ„, 𝐷 = log Ξ -.%

/

P 𝐷 w3 𝐷 π‘₯3+% 𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) = βˆ‘-.%

/

[log P 𝐷 w3 𝐷 π‘₯3+% + log𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) ]

v Maximizing LL(πœ„, 𝐷) can be done by

alternatively update πœ„ and 𝐷

1. max

X∈Y 𝑀𝑀(πœ„,𝐷)

2. max

[ 𝑀𝑀(πœ„,𝐷)

14 6501 Natural Language Processing

slide-15
SLIDE 15

max

X∈Y 𝑀𝑀(πœ„, 𝐷)

LL(πœ„, 𝐷 ) = log 𝑄 π‘₯#, π‘₯%, π‘₯&, … , π‘₯( πœ„, 𝐷 = log Ξ -.%

/

P 𝐷 w3 𝐷 π‘₯3+% 𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) = βˆ‘-.%

/

[log P 𝐷 w3 𝐷 π‘₯3+% + log𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) ]

v 𝑄(𝑑′ ∣ 𝑑) =

#(]^,]) #]

v 𝑄(π‘₯ ∣ 𝑑) =

#(_,]) #]

6501 Natural Language Processing 15

slide-16
SLIDE 16

max

[ 𝑀𝑀(πœ„, 𝐷)

max

[ βˆ‘-.% /

[log P 𝐷 w3 𝐷 π‘₯3+% + log𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) ] = n βˆ‘ βˆ‘ π‘ž 𝑑, 𝑑J log a ],]^

a ] a(]^) + 𝐻 c ]J.% c ].%

where G is a constant

vHere,

π‘ž 𝑑, 𝑑J =

# ],]^ βˆ‘ #(],]^)

d,d^

, π‘ž 𝑑 =

# ] βˆ‘ #(])

d

v

a ],]^ a ] a(]^) = a 𝑑 𝑑J a ]

(mutual information)

16 6501 Natural Language Processing

slide-17
SLIDE 17

max

[ 𝑀𝑀(πœ„, 𝐷)

max

[ βˆ‘-.% /

[log P 𝐷 w3 𝐷 π‘₯3+% + log𝑄(π‘₯3 ∣ 𝐷 π‘₯3 ) ] = n βˆ‘ βˆ‘ π‘ž 𝑑, 𝑑J log a ],]^

a ] a(]^) + 𝐻 c ]J.% c ].%

17 6501 Natural Language Processing

slide-18
SLIDE 18

Algorithm 1

v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps:

vPick 2 clusters and merge them vEach step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷)

v Cost? (can be improved to 𝑃( π‘Š g)) O(|V|-k) 𝑃( π‘Š &) 𝑃 ( π‘Š &) = 𝑃( π‘Š h) #Iters #pairs compute LL

6501 Natural Language Processing 18

slide-19
SLIDE 19

Algorithm 2

v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑%, 𝑑&,𝑑g,… 𝑑i v For 𝑗 = 𝑛 + 1 … |π‘Š|

v Create a new cluster 𝑑il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on

𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters

v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛& + π‘œ , n=#words in corpus

6501 Natural Language Processing 19

slide-20
SLIDE 20

Example clusters (Brown+1992)

6501 Natural Language Processing 20

slide-21
SLIDE 21

Example Hierarchy(Miller+2004)

6501 Natural Language Processing 21

slide-22
SLIDE 22

Quiz 1

v 30 min (9/20 Tue. 12:30pm-1:00pm)

vFill-in-the-blank, True/False vShort answer

v Closed book, Closed notes, Closed laptop v Sample questions:

v Add one smoothing v.s. Add-Lambda Smoothing

v𝑏 = 1,3,5 , 𝑐 = 2,3,6 what is the cosine similarity between a and 𝑐?

6501 Natural Language Processing 22

slide-23
SLIDE 23

6501 Natural Language Processing 23