Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

lecture 9 hidden markov model
SMART_READER_LITE
LIVE PREVIEW

Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 This lecture v Hidden Markov Model v Different views of HMM v HMM


slide-1
SLIDE 1

Lecture 9: Hidden Markov Model

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

v Hidden Markov Model v Different views of HMM v HMM in supervised learning setting

2 CS6501 Natural Language Processing

slide-3
SLIDE 3

CS6501 Natural Language Processing 3

Recap: Parts of Speech

v Traditional parts of speech

v~ 8 of them

slide-4
SLIDE 4

CS6501 Natural Language Processing 4

Recap: Tagset

v Penn TreeBank tagset”, 45 tags:

vPRP$, WRB, WP$, VBG vPenn POS annotations:

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

v Universal Tag set, 12 tags

v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, “.”, X

slide-5
SLIDE 5

CS6501 Natural Language Processing 5

Recap: POS Tagging v.s. Word clustering

v Words often have more than one POS: back

vThe back door = JJ vOn my back = NN vWin the voters back = RB vPromised to back the bill = VB

v Syntax v.s. Semantics (details later)

These examples from Dekang Lin

slide-6
SLIDE 6

Recap: POS tag sequences

v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_ADJ_+_NOUN_%2C_ADV_+_NOU N_%2C+_ADV_+_VERB_

CS6501 Natural Language Processing 6

Existing methods often model POS tagging as a sequence tagging problem

slide-7
SLIDE 7

Evaluation

v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank

vState of the art ~97% vTrivial baseline (most likely tag) ~94% vHuman performance ~97%

CS6501 Natural Language Processing 7

slide-8
SLIDE 8

Building a POS tagger

v Supervised learning

vAssume linguistics have annotated several examples

CS6501 Natural Language Processing 8

The/DT grand/JJ jury/NN commented/VBD

  • n/IN a/DT number/NN of/IN other/JJ

topics/NNS ./. Tag set: DT, JJ, NN, VBD… POS Tagger

slide-9
SLIDE 9

POS induction

v Unsupervised learning

vAssume we only have an unannotated corpus

CS6501 Natural Language Processing 9

The grand jury commented on a number of other topics . Tag set: DT, JJ, NN, VBD… POS Tagger

slide-10
SLIDE 10

TODAY: Hidden Markov Model

v We focus on supervised learning setting v What is the most likely sequence of tags for the given sequence of words w v We will talk about other ML models for this type of prediction tasks later.

CS6501 Natural Language Processing 10

slide-11
SLIDE 11

Let’s try

a/DT d6g/NN 0s/VBZ chas05g/VBG a/DT cat/NN ./. a/DT f6x/NN 0s/VBZ 9u5505g/VBG ./. a/DT b6y/NN 0s/VBZ s05g05g/VBG ./. a/DT ha77y/JJ b09d/NN What is the POS tag sequence of the following sentence? a ha77y cat was s05g05g .

CS6501 Natural Language Processing 11

Don’t worry! There is no problem with your eyes or computer.

slide-12
SLIDE 12

Let’s try

v a/DT d6g/NN 0s/VBZ chas05g/VBG a/DT cat/NN ./. a/DT dog/NN is/VBZ chasing/VBG a/DT cat/NN ./. v a/DT f6x/NN 0s/VBZ 9u5505g/VBG ./. a/DT fox/NN is/VBZ running/VBG ./. v a/DT b6y/NN 0s/VBZ s05g05g/VBG ./. a/DT boy/NN is/VBZ singing/VBG ./. v a/DT ha77y/JJ b09d/NN a/DT happy/JJ bird/NN v a ha77y cat was s05g05g . a happy cat was singing .

CS6501 Natural Language Processing 12

slide-13
SLIDE 13

How you predict the tags?

v Two types of information are useful

vRelations between words and tags vRelations between tags and tags

v DT NN, DT JJ NN…

CS6501 Natural Language Processing 13

slide-14
SLIDE 14

Statistical POS tagging

v What is the most likely sequence of tags for the given sequence of words w

CS6501 Natural Language Processing 14

P( DT JJ NN | a smart dog) = P(DD JJ NN a smart dog) / P (a smart dog) ∝ P(DD JJ NN a smart dog) = P(DD JJ NN) P(a smart dog | DD JJ NN )

slide-15
SLIDE 15

Transition Probability

v Joint probability 𝑄(𝒖, 𝒙) = 𝑄 𝒖 𝑄(𝒙|𝒖) v 𝑄 𝒖 = 𝑄 𝑢+,𝑢,, …𝑢.

= 𝑄 𝑢+ 𝑄 𝑢, ∣ 𝑢+ 𝑄 𝑢1 ∣ 𝑢,, 𝑢+ … 𝑄 𝑢. 𝑢+ … 𝑢.2+ ∼ P t+ P t, 𝑢+ 𝑄 𝑢1 𝑢, … 𝑄(𝑢. ∣ 𝑢.2+) = Π78+

. 𝑄 𝑢7 ∣ 𝑢72+

v Bigram model over POS tags! (similarly, we can define a n-gram model over POS tags, usually we called high-order HMM)

CS6501 Natural Language Processing 15

Markov assumption

slide-16
SLIDE 16

Emission Probability

v Joint probability 𝑄(𝒖, 𝒙) = 𝑄 𝒖 𝑄(𝒙|𝒖) v Assume words only depend on their POS-tag v 𝑄 𝒙 𝒖 ∼ 𝑄 𝑥+ 𝑢+ 𝑄 𝑥, 𝑢, … 𝑄(𝑥. ∣ 𝑢.) = Π78+

. 𝑄 𝑥7 𝑢7

i.e., P(a smart dog | DD JJ NN )

= P(a | DD) P(smart | JJ ) P( dog | NN )

CS6501 Natural Language Processing 16

Independent assumption

slide-17
SLIDE 17

Put them together

v Joint probability 𝑄(𝒖, 𝒙) = 𝑄 𝒖 𝑄(𝒙|𝒖)

v 𝑄 𝒖, 𝒙 = P t+ P t, 𝑢+ 𝑄 𝑢1 𝑢, … 𝑄 𝑢. 𝑢.2+

𝑄 𝑥+ 𝑢+ 𝑄 𝑥, 𝑢, …𝑄(𝑥. ∣ 𝑢.)

= Π78+

. 𝑄 𝑥7 𝑢7 𝑄 𝑢7 ∣ 𝑢72+

e.g., P(a smart dog , DD JJ NN )

= P(a | DD) P(smart | JJ ) P( dog | NN )

P(DD | start) P(JJ | DD) P(NN | JJ )

CS6501 Natural Language Processing 17

slide-18
SLIDE 18

Put them together

v Two independent assumptions

vApproximate P(t) by a bi(or N)-gram model vAssume each word depends only on its POStag

CS6501 Natural Language Processing 18

initial probability 𝑞(𝑢+)

slide-19
SLIDE 19

HMMs as probabilistic FSA

CS6501 Natural Language Processing 19

Julia Hockenmaier: Intro to NLP

slide-20
SLIDE 20

Table representation

CS6501 Natural Language Processing 20

Let 𝜇 = {𝐵, 𝐶, 𝜌} represents all parameters

slide-21
SLIDE 21

21

v States T = t1, t2…tN; v Observations W= w1, w2…wN;

v Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

v Transition probabilities

v Transition probability matrix A = {aij}

v Observation likelihoods

v Output probability matrix B={bi(k)}

v Special initial probability vector π

Hidden Markov Models (formal)

𝑏7V = 𝑄 𝑢7 = 𝑘 𝑢72+ = 𝑗 1 ≤ 𝑗, 𝑘 ≤ 𝑂 𝑐7(𝑙) = 𝑄 𝑥7 = 𝑤_ 𝑢7 = 𝑗 𝜌7 = 𝑄 𝑢+ = 𝑗 1 ≤ 𝑗 ≤ 𝑂

CS6501 Natural Language Processing

slide-22
SLIDE 22

How to build a second-order HMM?

v Second-order HMM

vTrigram model over POS tags

v𝑄 𝒖 = Π78+

. 𝑄 𝑢7 ∣ 𝑢72+, 𝑢72,

v𝑄 𝒙, 𝒖 = Π78+

. 𝑄 𝑢7 ∣ 𝑢72+, 𝑢72, 𝑄(𝑥7 ∣ 𝑢7)

CS6501 Natural Language Processing 22

slide-23
SLIDE 23

Probabilistic FSA for second-order HMM

CS6501 Natural Language Processing 23

Julia Hockenmaier: Intro to NLP

slide-24
SLIDE 24

Prediction in generative model

v Inference: What is the most likely sequence of tags for the given sequence of words w v What are the latent states that most likely generate the sequence of word w

CS6501 Natural Language Processing 24

initial probability 𝑞(𝑢+)

slide-25
SLIDE 25

CS6501 Natural Language Processing 25

Example: The Verb “race”

v Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR v People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN

  • uter/JJ space/NN

v How do we pick the right tag?

slide-26
SLIDE 26

26

Disambiguating “race”

CS6501 Natural Language Processing

slide-27
SLIDE 27

27

Disambiguating “race”

v P(NN|TO) = .00047 v P(VB|TO) = .83 v P(race|NN) = .00057 v P(race|VB) = .00012 v P(NR|VB) = .0027 v P(NR|NN) = .0012 v P(VB|TO)P(NR|VB)P(race|VB) = .00000027 v P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 v So we (correctly) choose the verb reading,

CS6501 Natural Language Processing

slide-28
SLIDE 28

28

Jason and his Ice Creams

v You are a climatologist in the year 2799 v Studying global warming v You can’t find any records of the weather in Baltimore, MA for summer of 2007 v But you find Jason Eisner’s diary v Which lists how many ice-creams Jason ate every date that summer v Our job: figure out how hot it was

http://videolectures.net/hltss2010_eisner_plm/ http://www.cs.jhu.edu/~jason/papers/eisner.hmm.xls

CS6501 Natural Language Processing

slide-29
SLIDE 29

(C)old day v.s. (H)ot day

CS6501 Natural Language Processing 29

p(…|C) p(…|H) p(…|START) p(1|…) 0.7 0.1 p(2|…) 0.2 0.2 p(3|…) 0.1 0.7 p(C|…) 0.8 0.1 0.5 p(H|…) 0.1 0.8 0.5 OP|…) 0.1 0.1 ard" o

#cones

0.5 1 1.5 2 2.5 3 3.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Diary Day

Weather States that Best Explain Ice Cream Consumption

Ice Creams p(H)

slide-30
SLIDE 30

Three basic problems for HMMs

v Likelihood of the input:

vCompute 𝑄(𝒙 ∣ 𝜇) for the input 𝒙 and HMM 𝜇

v Decoding (tagging) the input:

vFind the best tag sequence 𝑏𝑠𝑕𝑛𝑏𝑦d 𝑄(𝒖 ∣ 𝒙, 𝜇)

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated v Case 2: unsupervised -- only unannotated text

CS6501 Natural Language Processing 30

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-31
SLIDE 31

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 31

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-32
SLIDE 32

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 32

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-33
SLIDE 33

Learning from Labeled Data

v Let play a game! v We count how often we see 𝑢72+𝑢7 and we𝑢7 then normalize.

CS6501 Natural Language Processing 33

slide-34
SLIDE 34

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 34

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

We need dynamic programming for the other problems