A gentle introduction to Maximum Entropy Models and their friends
Mark Johnson Brown University November 2007
1 / 32
A gentle introduction to Maximum Entropy Models and their friends - - PowerPoint PPT Presentation
A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University November 2007 1 / 32 Outline What problems can MaxEnt solve? What are Maximum Entropy models? Learning Maximum Entropy models from data
1 / 32
2 / 32
◮ ONSET: Violated each time a syllable begins without an onset ◮ PEAK: Violated each time a syllable doesn’t have a peak V ◮ NOCODA: Violated each time a syllable has a non-empty coda ◮ ⋆COMPLEX: Violated each time a syllable has a complex onset
◮ FAITHV: Violated each time a V is inserted or deleted ◮ FAITHC: Violated each time a C is inserted or deleted
⋆COMPLEX
⋆!
⋆!
3 / 32
◮ Example: FAITHC(/Pilkhin/, [
Pik.hin]) = 1◮ Ex: if f = (PEAK, ⋆COMPLEX, FAITHC, FAITHV, NOCODA),
4 / 32
◮ Ex: If f = (PEAK, ⋆COMPLEX, FAITHC, FAITHV, NOCODA),
◮ Ex: f(/Pilkhin/, [
Pik.hin]) = (0, 0, 1, 0, 2), so◮ Called “linear” because the score is a linear function of the
5 / 32
6 / 32
◮ Ci is a (possibly infinite) set of candidates ◮ xi ∈ Ci is the correct realization from Ci
◮ Unsupervised problem: underlying form is not given in D
◮ If w exists then D is linearly separable
7 / 32
◮ too many f(x) vectors to memorize if f contained all
◮ maybe the supervised learning problem is unrealistically easy,
8 / 32
◮ Although SVMs try to maximize the probability that the
9 / 32
10 / 32
◮ In linguistics, e.g., “voiced” is a function from phones to +, − ◮ In statistics, what linguists call constraints (a function from
◮ In linguistics, what the statisticians call “features” ◮ In statistics, a property that the estimated model
◮ In statistics, the set of objects we’re defining a probability
11 / 32
◮ Require
◮ As size of data |D| → ∞, feature distribution in D will
◮ so distribution of features in
◮ The entropy measures the amount of information in a distribution ◮ Higher entropy ⇒ less information ◮ Choose the
12 / 32
13 / 32
◮ Long-distance and local dependencies in syntax ◮ Many markedness and faithfulness constraints interact to
◮ generally requires numerical optimization 14 / 32
◮ In statistical mechanics (physics) as the Gibbs and Boltzmann
◮ In probability theory, as Maximum Entropy models, log-linear
◮ In statistics, as logistic regression ◮ In neural networks, as Boltzmann machines 15 / 32
16 / 32
17 / 32
18 / 32
19 / 32
◮ depending on f and C, might be possible to explicitly calculate
◮ may be able to approximate Zw(Ci) and Ew[ fj|Ci], especially if
20 / 32
21 / 32
◮ A feature fj is pseudo-minimal iff for all i = 1, . . . , n and x′ ∈ Ci,
◮ If fj is pseudo-minimal, then
22 / 32
23 / 32
24 / 32
25 / 32
◮ Ci is set of 50 parses from Charniak parser, xi is best parse in Ci ◮ Charniak parser’s accuracy ≈ 0.898 (picking tree it likes best) ◮ Oracle accuracy is ≈ 0.968 ◮ EM-like method for dealing with ties (training data Ci contains
◮ including the Charniak model’s log probability of the tree ◮ trained on parse trees for 36,000 sentences ◮ prior weight α set by cross-validation (don’t need to be accurate)
26 / 32
27 / 32
◮ It is usually not an efficient optimization method
◮ It is a simple and sometimes very efficient method 28 / 32
29 / 32
30 / 32
31 / 32
32 / 32