Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

▶

Feb 16, 2024 252 likes •499 views

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2

SLIDE 1

Lecture 8: Word Clustering

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 6501 Natural Language Processing

SLIDE 2

This lecture

v Brown Clustering

2 6501 Natural Language Processing

SLIDE 3

Brown Clustering

v Similar to language model But, basic unit is “word clusters”

v Intuition: again, similar words appear in similar context

v Recap: Bigram Language Models

v 𝑄 𝑥#, 𝑥%, 𝑥&, … , 𝑥( = 𝑄 𝑥% 𝑥# 𝑄 𝑥& 𝑥% … 𝑄 𝑥( 𝑥(+% = Π-.%

P(w3 ∣ 𝑥3+%)

3 6501 Natural Language Processing

𝑥# is a dummy word representing ”begin of a sentence”

SLIDE 4

Motivation example

v ”a dog is chasing a cat”

v 𝑄 𝑥#, “𝑏”, ”𝑒𝑝𝑕”, … , “𝑑𝑏𝑢” = 𝑄 ”𝑏” 𝑥# 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑢” ”𝑏”

v Assume Every word belongs to a cluster

4 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting…

SLIDE 5

Motivation example

v Assume every word belongs to a cluster v“a dog is chasing a cat”

5 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46

SLIDE 6

Motivation example

v Assume every word belongs to a cluster v“a dog is chasing a cat”

6 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

SLIDE 7

Motivation example

v Assume every word belongs to a cluster v“the boy is following a rabbit”

7 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 the boy is following a rabbit

SLIDE 8

Motivation example

v Assume every word belongs to a cluster v“a fox was chasing a bird”

8 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a fox was chasing a bird

SLIDE 9

Brown Clustering

v Let 𝐷 𝑥 denote the cluster that 𝑥 belongs to v“a dog is chasing a cat”

9 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

SLIDE 10

Brown clustering model

v P(“a dog is chasing a cat”)

= P(C(“a”)|𝐷#) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))...

10 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

SLIDE 11

Brown clustering model

v P(“a dog is chasing a cat”)

= P(C(“a”)|𝐷#) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... v In general

𝑄 𝑥#, 𝑥%, 𝑥&, … , 𝑥( = 𝑄 𝐷(𝑥%) 𝐷 𝑥# 𝑄 𝐷(𝑥&) 𝐷(𝑥%) … 𝑄 𝐷 𝑥( 𝐷 𝑥(+% 𝑄(𝑥%|𝐷 𝑥% 𝑄 𝑥& 𝐷 𝑥& … 𝑄(𝑥(|𝐷 𝑥( ) = Π-.%

P 𝐷 w3 𝐷 𝑥3+% 𝑄(𝑥3 ∣ 𝐷 𝑥3 )

11 6501 Natural Language Processing

SLIDE 12

Model parameters

𝑄 𝑥#, 𝑥%, 𝑥&, … , 𝑥( = Π-.%

P 𝐷 w3 𝐷 𝑥3+% 𝑄(𝑥3 ∣ 𝐷 𝑥3 )

12 6501 Natural Language Processing Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

Parameter set 1: 𝑄(𝐷(𝑥3)|𝐷 𝑥3+% ) Parameter set 2: 𝑄(𝑥3|𝐷 𝑥3 ) Parameter set 3: 𝐷 𝑥3

SLIDE 13

Model parameters

𝑄 𝑥#, 𝑥%, 𝑥&, … , 𝑥( = Π-.%

P 𝐷 w3 𝐷 𝑥3+% 𝑄(𝑥3 ∣ 𝐷 𝑥3 ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 → {1, 2, 3, … 𝑙 }

v A partition of vocabulary into k classes

v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑J ∈ 1,… , 𝑙 v Conditional probability 𝑄(𝑥 ∣ 𝑑) for 𝑑, 𝑑J ∈ 1,… , 𝑙 ,𝑥 ∈ 𝑑

13 6501 Natural Language Processing

𝜄 represents the set of conditional probability parameters C represents the clustering

SLIDE 14

Log likelihood

LL(𝜄, 𝐷 ) = log 𝑄 𝑥#, 𝑥%, 𝑥&,… , 𝑥( 𝜄, 𝐷 = log Π-.%

P 𝐷 w3 𝐷 𝑥3+% 𝑄(𝑥3 ∣ 𝐷 𝑥3 ) = ∑-.%

[log P 𝐷 w3 𝐷 𝑥3+% + log𝑄(𝑥3 ∣ 𝐷 𝑥3 ) ]

v Maximizing LL(𝜄, 𝐷) can be done by

alternatively update 𝜄 and 𝐷

1. max

X∈Y 𝑀𝑀(𝜄,𝐷)

2. max

[ 𝑀𝑀(𝜄,𝐷)

14 6501 Natural Language Processing

SLIDE 15

max

X∈Y 𝑀𝑀(𝜄, 𝐷)

LL(𝜄, 𝐷 ) = log 𝑄 𝑥#, 𝑥%, 𝑥&, … , 𝑥( 𝜄, 𝐷 = log Π-.%

P 𝐷 w3 𝐷 𝑥3+% 𝑄(𝑥3 ∣ 𝐷 𝑥3 ) = ∑-.%

[log P 𝐷 w3 𝐷 𝑥3+% + log𝑄(𝑥3 ∣ 𝐷 𝑥3 ) ]

v 𝑄(𝑑′ ∣ 𝑑) =

#(]^,]) #]

v 𝑄(𝑥 ∣ 𝑑) =

#(_,]) #]

6501 Natural Language Processing 15

SLIDE 16

max

[ 𝑀𝑀(𝜄, 𝐷)

max

[ ∑-.% /

[log P 𝐷 w3 𝐷 𝑥3+% + log𝑄(𝑥3 ∣ 𝐷 𝑥3 ) ] = n ∑ ∑ 𝑞 𝑑, 𝑑J log a ],]^

a ] a(]^) + 𝐻 c ]J.% c ].%

where G is a constant

vHere,

𝑞 𝑑, 𝑑J =

# ],]^ ∑ #(],]^)

d,d^

, 𝑞 𝑑 =

# ] ∑ #(])

v

a ],]^ a ] a(]^) = a 𝑑 𝑑J a ]

(mutual information)

16 6501 Natural Language Processing

SLIDE 17

max

[ 𝑀𝑀(𝜄, 𝐷)

max

[ ∑-.% /

[log P 𝐷 w3 𝐷 𝑥3+% + log𝑄(𝑥3 ∣ 𝐷 𝑥3 ) ] = n ∑ ∑ 𝑞 𝑑, 𝑑J log a ],]^

a ] a(]^) + 𝐻 c ]J.% c ].%

17 6501 Natural Language Processing

SLIDE 18

Algorithm 1

v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps:

vPick 2 clusters and merge them vEach step pick the merge maximizing 𝑀𝑀(𝜄, 𝐷)

v Cost? (can be improved to 𝑃( 𝑊 g)) O(|V|-k) 𝑃( 𝑊 &) 𝑃 ( 𝑊 &) = 𝑃( 𝑊 h) #Iters #pairs compute LL

6501 Natural Language Processing 18

SLIDE 19

Algorithm 2

v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑%, 𝑑&,𝑑g,… 𝑑i v For 𝑗 = 𝑛 + 1 … |𝑊|

v Create a new cluster 𝑑il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on

𝑀𝑀 𝜄, 𝐷 and merge ⇒ back to m clusters

v Carry out (m-1) final merges ⇒ full hierarchy v Running time O 𝑊 𝑛& + 𝑜 , n=#words in corpus

6501 Natural Language Processing 19

SLIDE 20

Example clusters (Brown+1992)

6501 Natural Language Processing 20

SLIDE 21

Example Hierarchy(Miller+2004)

6501 Natural Language Processing 21

SLIDE 22

Quiz 1

v 30 min (9/20 Tue. 12:30pm-1:00pm)

vFill-in-the-blank, True/False vShort answer

v Closed book, Closed notes, Closed laptop v Sample questions:

v Add one smoothing v.s. Add-Lambda Smoothing

v𝑏 = 1,3,5 , 𝑐 = 2,3,6 what is the cosine similarity between a and 𝑐?

6501 Natural Language Processing 22

SLIDE 23

6501 Natural Language Processing 23