Unsupervised Context Discrimination and Cluster Stopping
Anagha Kulkarni
Department of Computer Science University of Minnesota, Duluth
July 5, 2006
Unsupervised Context Discrimination and Cluster Stopping Anagha - - PowerPoint PPT Presentation
Unsupervised Context Discrimination and Cluster Stopping Anagha Kulkarni Department of Computer Science University of Minnesota, Duluth July 5, 2006 What is a Context? For the purpose of this thesis which deals with written text: A
Anagha Kulkarni
Department of Computer Science University of Minnesota, Duluth
July 5, 2006
July 5, 2006 2
July 5, 2006 3
July 5, 2006 4
1. Let us roll this sheet and bind it with a tape. 2. I prefer this brand of tape over any other because it binds the best. 3. As she sang the melodious song he recorded her on the tape. 4. As he moved forward to adjust the volume of the tape playing this loud song…
July 5, 2006 5
Example: 1. George Miller is an Emeritus Professor of Psychology at the Princeton University and is often referred to as the father of the WordNet. 2. The Mad‐Max movie made the Australian director, George Miller, a celebrity overnight. 3. George Miller is an acclaimed movie director.
July 5, 2006 6
1. “Hi, Iʹm looking for a program which is able to display 24 bit images. We are using a Sun Sparc equipped with Parallax graphics board running X11. Thanks in advance.” 2. “I currently have some grayscale image files that are not in any standard format. They simply contain the 8‐bit pixel values. I would like to display these images on a PC. The conversion to a GIF format would be helpful. “ 3. “I really feel the need for a knowledgeable hockey observer to explain this yearʹs playoffs to me. I mean, the obviously superior Toronto team with the best center and the best goalie in the league keeps losing.”
July 5, 2006 7
– To avoid the knowledge acquisition bottleneck – To keep the method applicable across domains – To keep the method applicable across languages – To keep the method applicable across time
July 5, 2006 8
July 5, 2006 9
July 5, 2006 10
July 5, 2006 11
July 5, 2006 12
be used to represent the given contexts.
data.
Example: Movie Professor Director Psychology Mad‐Max Princeton Australia WordNet
George Miller is a Emeritus Professor of Psychology at the Princeton University and is often referred to as the father of the WordNet.
July 5, 2006 13
Example: Movie, Professor, Director, Psychology…
Example: Movie Director, Princeton University…
Example: Director Movie, Princeton University…
Example: tape playing, binding tape…
July 5, 2006 14
For bigrams and co‐occurrences: – OR Mode: Remove if either of the words is a stopword. – AND Mode: Remove only if both the words are stopwords.
July 5, 2006 15
July 5, 2006 16
Example:
Princeton University and is often referred to as the father of the WordNet.
Miller, a celebrity overnight.
Movie Professor Director Psychology Mad‐Max Princeton Australian Context1 1 1 1 Context2 1 1 1 1
First Order Context Representation (Order1)
Context vector: C1 Context vector: C2
July 5, 2006 17
July 5, 2006 18
Director University Mad‐Max Psychology Industry …
Movie 1 Professor 1 1 Father 1 … 1 1 1 Princeton 1 1 Film 1 1 Australian 1 1 Celebrity 1 1
July 5, 2006 19
acclaimed movie director Context vector: C1 Context vector: C2 work film industry
July 5, 2006 20
Movie Professor Director Psychology Mad‐Max Princeton Australian University Context1 1 1 Context4 1 1 1 Context5 1 1 1 Context2 1 1 1 1 Context3 1 1 Context6 1 1 1 d1 d2 d3 d4 Context1 0.7859 ‐0.5961 0.0579 0.0579 0.7115 0.3087 ‐0.8758 0.3087 ‐0.3261 Context2 0.7859 ‐0.5961 ‐0.3261 Context3 0.3546 ‐0.3662 0.7662 Context4 0.5385 0.8373 ‐0.1271 Context5 0.7716 0.2139 0.4897 Context6 0.5385 0.8373 ‐0.1271
Order1 matrix: M1 SVD reduced matrix: M1reduced
July 5, 2006 21
d1 d2 d3 Movie ‐0.6360 Professor ‐0.7933 ‐0.8230 Princeton ‐0.9893 0.3663 Mad ‐0.8145 Australian ‐0.6360 Celebrity ‐0.8145 Father ‐0.4403 0.6600 Director University Max Psychology Overnight WordNet Movie 1 Professor 1 1 Princeton 1 1 Mad 1 1 Australian 1 Celebrity 1 1 Father 1
Order2: Step1: Word‐by‐word matrix: M2 SVD reduced matrix: M2reduced
July 5, 2006 22
July 5, 2006 23
July 5, 2006 24
I2(4)
July 5, 2006 25
I2(?) ?
July 5, 2006 26
July 5, 2006 27
July 5, 2006 28
July 5, 2006 29
– Null hypothesis: H0: For the given dataset optimal k = 1. – Alternative hypothesis: H1: For the given dataset optimal k > 1
– Generate a data for the null reference model with expected k = 1. – Generate a plot (PObserved) of crfun(m) values for the given or
– Generate a plot (PReference) of crfun(m) values for the generated reference data. – Compare PObserved with the Preference and find the largest “gap” between them. – The first point of maximum gap is the optimal k value!
July 5, 2006 30
I2Observed_data(m) I2Reference_data(m)
for DS
July 5, 2006 31
July 5, 2006 32
July 5, 2006 33
July 5, 2006 34
July 5, 2006 35
C0: Australian Senator Communications Information, Media Release, Minister Communications, Information Technology C1: Choreographer Artistic Director, Dance Company
selection) to select the top N bigrams.
July 5, 2006 36
July 5, 2006 37
– 7 datasets
– 6 datasets
July 5, 2006 38
– Contents from top 50 (html) pages. – Traversed one level deep.
– Richard Alston, 2 entities, 247 contexts. – Sarah Connor, 2 entities, 150 contexts – George Miller, 3 entities, 286 contexts – Michael Collins, 4 entities, 333 contexts – Ted Pedersen, 4 entities, 359 contexts
July 5, 2006 39
– 20, 000 USENET posting manually categorized into 20 groups. – e.g.: comp.graphics and rec.sport.hockey
from two or more groups.
– 7 datasets
– 6 datasets
July 5, 2006 40
July 5, 2006 41
July 5, 2006 42
July 5, 2006 43
F‐measure using Order1 & unigram NameConflate‐Distinct F‐measure using Order1 & unigram NameConflate‐Subtle F‐measure using Order2 & bigrams F‐measure using Order2 & bigrams
July 5, 2006 44
F‐measure Without SVD Email‐Distinct F‐measure Without SVD WSD F‐measure With SVD F‐measure With SVD
July 5, 2006 45
F‐measure using Repeated Bisections Web F‐measure using Agglomerative F‐measure using Repeated Bisections NameConflate‐Subtle F‐measure using Agglomerative
July 5, 2006 46
Baseline F‐measure NameConflate‐Distinct F‐measure for all settings Baseline F‐measure NameConflate‐Subtle F‐measure for all settings
July 5, 2006 47
Baseline F‐measure Email‐Distinct F‐measure for all settings Baseline F‐measure Email‐Subtle F‐measure for all settings
July 5, 2006 48
July 5, 2006 49
NameConflate‐ Distinct NameConflate‐Subtle
July 5, 2006 50
July 5, 2006 51
July 5, 2006 52
July 5, 2006 53
July 5, 2006 54
July 5, 2006 55
Perform name disambiguation based on biographical data from WWW.
Introduce L‐method for cluster‐stopping which is based on fitting lines through evaluation graphs.
Introduce G‐means method for cluster‐stopping which is based on fitting a Gaussian distribution to each cluster.
July 5, 2006 56
July 5, 2006 57
– http://www.d.umn.edu/~tpederse/tools.html
– http://www.d.umn.edu/~tpederse/data.html – http://www.d.umn.edu/~tpederse/senseclusters‐pubs.html