[PPT] - Unsupervised Context Discrimination and Cluster Stopping Anagha PowerPoint Presentation

SLIDE 1

Unsupervised Context Discrimination and Cluster Stopping

Anagha Kulkarni

Department of Computer Science University of Minnesota, Duluth

July 5, 2006

SLIDE 2

July 5, 2006 2

What is a “Context”?

For the purpose of this thesis which deals with

written text:

– A Sentence – A Paragraph – Complete Text from a document

More generally any unit of text per se!

SLIDE 3

July 5, 2006 3

What is “Context Discrimination”?

Grouping contexts based on their mutual similarity or dissimilarity.

Example: 1. We had a very hot summer last year. 2. Germany is hosting FIFA 2006. 3. The weather in Duluth is highly dynamic and thus hard to predict. 4. England is out of World Cup 2006!

SLIDE 4

July 5, 2006 4

Word Sense Discrimination (WSD)

About: Ambiguous words (target or head word).
Task: To group the given contexts based on the

meaning of the ambiguous word.

Example:

1. Let us roll this sheet and bind it with a tape. 2. I prefer this brand of tape over any other because it binds the best. 3. As she sang the melodious song he recorded her on the tape. 4. As he moved forward to adjust the volume of the tape playing this loud song…

SLIDE 5

July 5, 2006 5

Name Discrimination

About: People, places, organizations sharing

same name (target or head word).

Task: To group the given contexts based on the

underlying entity of the ambiguous name.

Example: 1. George Miller is an Emeritus Professor of Psychology at the Princeton University and is often referred to as the father of the WordNet. 2. The Mad‐Max movie made the Australian director, George Miller, a celebrity overnight. 3. George Miller is an acclaimed movie director.

SLIDE 6

July 5, 2006 6

Email Clustering

About: Email grouping
Task: To group the given emails based on the

similarity of their contents. Headless Clustering!

Example:

1. “Hi, Iʹm looking for a program which is able to display 24 bit images. We are using a Sun Sparc equipped with Parallax graphics board running X11. Thanks in advance.” 2. “I currently have some grayscale image files that are not in any standard format. They simply contain the 8‐bit pixel values. I would like to display these images on a PC. The conversion to a GIF format would be helpful. “ 3. “I really feel the need for a knowledgeable hockey observer to explain this yearʹs playoffs to me. I mean, the obviously superior Toronto team with the best center and the best goalie in the league keeps losing.”

SLIDE 7

July 5, 2006 7

What is “Unsupervised Context Discrimination”?

Discriminating Contexts:

Without using any labeled/tagged data.
Without using external knowledge resources
Using only what is present in the contexts!
Why?

– To avoid the knowledge acquisition bottleneck – To keep the method applicable across domains – To keep the method applicable across languages – To keep the method applicable across time

SLIDE 8

July 5, 2006 8

Approach to WSD by Purandare & Pedersen [2004]

Based on the hypothesis of Contextual Similarity by Miller and Charles (1991): “any two words are semantically similar to the extent that their contexts are similar”

SLIDE 9

July 5, 2006 9

Major contributions of this thesis

Generalized Purandare and Pedersen [2004]

approach for WSD to the broader problem of Context Discrimination.

Introduced three measures for the cluster

stopping problem.

Introduced preliminary method of cluster

labeling.

SLIDE 10

July 5, 2006 10

Methodology: 5 Steps

Step1 Step2 Step3 Step4 Step5

SLIDE 11

July 5, 2006 11

Methodology: Lexical Feature Extraction

Step1

SLIDE 12

July 5, 2006 12

Lexical Features

Lexical Features: Are the words or word‐pairs of a language that can

be used to represent the given contexts.

Can be selected from: the test data or a separate feature selection

data.

No external knowledge in any shape or form used.
No syntactic information about the features used either.

Example: Movie Professor Director Psychology Mad‐Max Princeton Australia WordNet

George Miller is a Emeritus Professor of Psychology at the Princeton University and is often referred to as the father of the WordNet.

SLIDE 13

July 5, 2006 13

Types of Lexical Features

Unigrams: Single words.

Example: Movie, Professor, Director, Psychology…

Bigrams: Ordered word‐pairs.

Example: Movie Director, Princeton University…

Co‐occurrences: Unordered word‐pairs.

Example: Director Movie, Princeton University…

Target Co‐occurrences: Unordered word‐pairs of which
ne of the words is the target word.

Example: tape playing, binding tape…

SLIDE 14

July 5, 2006 14

Feature Filtering Techniques

Frequency cutoff: Remove features occurring less than X
times. To remove rare features.
Stoplisting: To remove function words such as “the”,

”of”, ”in”, ”a”, ”an” etc.

For bigrams and co‐occurrences: – OR Mode: Remove if either of the words is a stopword. – AND Mode: Remove only if both the words are stopwords.

Statistical tests of association (bigrams, co‐occurrences):

To check if the two words in a word‐pair occur together just by chance or they are truly related.

SLIDE 15

July 5, 2006 15

Methodology: Context Representation

Step2

SLIDE 16

July 5, 2006 16

Context Representation

The task of translating each textual context into a format that a computer can understand.

Example:

Context1: George Miller is an Emeritus Professor of Psychology at the

Princeton University and is often referred to as the father of the WordNet.

Context2: The Mad‐Max movie made the Australian director, George

Miller, a celebrity overnight.

Movie Professor Director Psychology Mad‐Max Princeton Australian Context1 1 1 1 Context2 1 1 1 1

First Order Context Representation (Order1)

Context vector: C1 Context vector: C2

SLIDE 17

July 5, 2006 17

Second Order Context Representation (Order2)

Tries to go beyond the “exact match” strategy of Order1 by capturing indirect relationships.

Example

1. George Miller is an acclaimed movie director.
2. George Miller has since continued his work in the

film industry.

3. Film director George Miller in the news for “Mad‐

Max”.

SLIDE 18

July 5, 2006 18

Order2: Step1: Creating the word‐by‐word matrix

Director University Mad‐Max Psychology Industry …

Movie 1 Professor 1 1 Father 1 … 1 1 1 Princeton 1 1 Film 1 1 Australian 1 1 Celebrity 1 1

SLIDE 19

July 5, 2006 19

Order2: Step2: Creating the context vectors

George Miller is an acclaimed movie director.
George Miller has since continued his work in the film industry.

acclaimed movie director Context vector: C1 Context vector: C2 work film industry

SLIDE 20

July 5, 2006 20

Singular Value Decomposition (SVD)

Movie Professor Director Psychology Mad‐Max Princeton Australian University Context1 1 1 Context4 1 1 1 Context5 1 1 1 Context2 1 1 1 1 Context3 1 1 Context6 1 1 1 d1 d2 d3 d4 Context1 0.7859 ‐0.5961 0.0579 0.0579 0.7115 0.3087 ‐0.8758 0.3087 ‐0.3261 Context2 0.7859 ‐0.5961 ‐0.3261 Context3 0.3546 ‐0.3662 0.7662 Context4 0.5385 0.8373 ‐0.1271 Context5 0.7716 0.2139 0.4897 Context6 0.5385 0.8373 ‐0.1271

Order1 matrix: M1 SVD reduced matrix: M1reduced

SLIDE 21

July 5, 2006 21

SVD (cont.)

d1 d2 d3 Movie ‐0.6360 Professor ‐0.7933 ‐0.8230 Princeton ‐0.9893 0.3663 Mad ‐0.8145 Australian ‐0.6360 Celebrity ‐0.8145 Father ‐0.4403 0.6600 Director University Max Psychology Overnight WordNet Movie 1 Professor 1 1 Princeton 1 1 Mad 1 1 Australian 1 Celebrity 1 1 Father 1

Order2: Step1: Word‐by‐word matrix: M2 SVD reduced matrix: M2reduced

SLIDE 22

July 5, 2006 22

Methodology: Predicting k via Cluster Stopping

Step3

SLIDE 23

July 5, 2006 23

Building blocks of Cluster Stopping

Criterion functions (crfun): Metric that the

clustering algorithms use to assess and optimize the quality of the generated clusters.

Types:

– Internal: Maximize within cluster similarity (I1, I2) – External: Minimize between cluster similarity (E1) – Hybrid: Internal + External (H1, H2)

Cluster a dataset iteratively into m clusters and

record crfun(m) values…

SLIDE 24

July 5, 2006 24

Contrived dataset: #contexts = 80, expected k = 4

m

I2(4)

I2(m)

SLIDE 25

July 5, 2006 25

Real dataset: #contexts = 900, expected k = 4 (DS)

I2(m) m

I2(?) ?

SLIDE 26

July 5, 2006 26

Cluster Stopping Measures

Based on the criterion functions.
Do not require any form of user input such as

setting a threshold value.

3 measures:

– PK2 – PK3 – Adapted Gap Statistic

SLIDE 27

July 5, 2006 27

PK2(m) = crfun(m) crfun(m −1)

m PK2(m) for DS

SLIDE 28

July 5, 2006 28

PK3(m) = 2*crfun(m) crfun(m −1) + crfun(m +1)

PK3(m) for DS m

SLIDE 29

July 5, 2006 29

Adapted Gap Statistic

Based on Gap Statistic by Tibshirani et al. (2001)
The main idea:

– Null hypothesis: H0: For the given dataset optimal k = 1. – Alternative hypothesis: H1: For the given dataset optimal k > 1

Algorithm:

– Generate a data for the null reference model with expected k = 1. – Generate a plot (PObserved) of crfun(m) values for the given or

bserved data.

– Generate a plot (PReference) of crfun(m) values for the generated reference data. – Compare PObserved with the Preference and find the largest “gap” between them. – The first point of maximum gap is the optimal k value!

SLIDE 30

July 5, 2006 30

Adapted Gap Statistic

m

I2Observed_data(m) I2Reference_data(m)

for DS

SLIDE 31

July 5, 2006 31

Adapted Gap Statistic (cont.)

Gap(m) m

SLIDE 32

July 5, 2006 32

Methodology: Clustering

Step4

SLIDE 33

July 5, 2006 33

Clustering

One of the primary methods of unsupervised

learning.

We support 3 types of clustering algorithms:

– Hierarchical (e.g.: Agglomerative) – Partitional (e.g.: K‐means) – Hybrid (e.g.: Repeated Bisections)

Aim: To appropriately group the given set of

context vectors into k clusters.

SLIDE 34

July 5, 2006 34

Methodology: Cluster Labeling

Step5

SLIDE 35

July 5, 2006 35

Cluster Labeling

Clusters Assigned Cluster Labels

C0: Australian Senator Communications Information, Media Release, Minister Communications, Information Technology C1: Choreographer Artistic Director, Dance Company

Aim: To identify the underlying entity for each cluster.
Descriptive labels: Top N bigrams of that cluster.
Discriminating labels: Top N bigrams unique to that cluster.
Can use frequency or statistical tests of association (like in feature

selection) to select the top N bigrams.

Cluster labels for an ambiguous name Richard Alston:

SLIDE 36

July 5, 2006 36

Experimental Data – 4 genre

SLIDE 37

July 5, 2006 37

NameConflate genre

Name discrimination data.
Source: The New York Times archives (Jan `02 to Dec `04)
Method: Creating pseudo ambiguity by conflation.
Multi‐dimensional ambiguity: 2, 3, 4, 5 or 6 names.
Distinct (e.g. “Bill Gates” & “Jason Kidd”)

– 7 datasets

Subtle (e.g. “Bill Gates” & “Steve Jobs”)

– 6 datasets

SLIDE 38

July 5, 2006 38

Web genre

Name discrimination data.
Source: The World Wide Web using Google search engine

– Contents from top 50 (html) pages. – Traversed one level deep.

Method: Manually cleaned and annotated.
Name variations: “Mr. Miller”, “Dr. Miller”, “G. Miller”…
5 datasets

– Richard Alston, 2 entities, 247 contexts. – Sarah Connor, 2 entities, 150 contexts – George Miller, 3 entities, 286 contexts – Michael Collins, 4 entities, 333 contexts – Ted Pedersen, 4 entities, 359 contexts

SLIDE 39

July 5, 2006 39

Email genre

Email Clustering data.
Source: 20 Newsgroups dataset

– 20, 000 USENET posting manually categorized into 20 groups. – e.g.: comp.graphics and rec.sport.hockey

Method: Creating artificial mixing of contexts by combining posting

from two or more groups.

Multi‐dimensional ambiguity: Conflated 2, 3 or 4 groups.
Distinct (e.g. “sci.electronics” & “soc.religion.christian”)

– 7 datasets

Subtle (e.g. “sci.crypt” & “sci.electronics”)

– 6 datasets

SLIDE 40

July 5, 2006 40

WSD genre

Word Sense Discrimination data.
Datasets for 4 ambiguous words: “hard”, “serve”, “line”

and “interest”.

Source: The cleaned and SENSEVAL2 formatted versions
f these datasets distributed by Dr. Ted Pedersen.

SLIDE 41

July 5, 2006 41

Experiments

SLIDE 42

July 5, 2006 42

Experimental Results

SLIDE 43

July 5, 2006 43

Order1 and unigrams vs. Order2 and bigrams

F‐measure using Order1 & unigram NameConflate‐Distinct F‐measure using Order1 & unigram NameConflate‐Subtle F‐measure using Order2 & bigrams F‐measure using Order2 & bigrams

SLIDE 44

July 5, 2006 44

Without SVD vs. With SVD

F‐measure Without SVD Email‐Distinct F‐measure Without SVD WSD F‐measure With SVD F‐measure With SVD

SLIDE 45

July 5, 2006 45

Repeated Bisection

vs. Agglomerative Clustering

F‐measure using Repeated Bisections Web F‐measure using Agglomerative F‐measure using Repeated Bisections NameConflate‐Subtle F‐measure using Agglomerative

SLIDE 46

July 5, 2006 46

NameConflate: Distinct vs. Subtle

Baseline F‐measure NameConflate‐Distinct F‐measure for all settings Baseline F‐measure NameConflate‐Subtle F‐measure for all settings

SLIDE 47

July 5, 2006 47

Email: Distinct vs. Subtle

Baseline F‐measure Email‐Distinct F‐measure for all settings Baseline F‐measure Email‐Subtle F‐measure for all settings

SLIDE 48

July 5, 2006 48

Cluster Stopping Results

SLIDE 49

July 5, 2006 49

NameConflate: k predictions

NameConflate‐ Distinct NameConflate‐Subtle

SLIDE 50

July 5, 2006 50

Web: k predictions

SLIDE 51

July 5, 2006 51

Email: k predictions

Email‐distinct Email‐subtle

SLIDE 52

July 5, 2006 52

WSD: k predictions

SLIDE 53

July 5, 2006 53

Conclusions

Generalized the approach of by Purandare and

Pedersen [2004] for WSD

– Name Discrimination (headed clustering) – Email Clustering (headless clustering) – Thus in general for “Context Discrimination”

Proposed and experimented with 3 cluster

stopping measures.

PK3 exhibits maximum agreement with the

given number of clusters.

SLIDE 54

July 5, 2006 54

Conclusions (cont.)

Order1 and Order2 provide a complimenting pair of

context representations.

Applying SVD generally does not help our methods.
Performance of the clustering algorithm of repeated

bisections is generally comparable with agglomerative except for the subtle type of datasets.

We also find that our methods are better equipped to

deal with “distinct” type of datasets than with “subtle” type of datasets.

SLIDE 55

July 5, 2006 55

Related Work

Mann and Yarowsky, CoNLL 2003.

Perform name disambiguation based on biographical data from WWW.

Salvador and Chan, IEEE‐ICTAI 2004.

Introduce L‐method for cluster‐stopping which is based on fitting lines through evaluation graphs.

Hamerly and Elkan, NIPS 2003.

Introduce G‐means method for cluster‐stopping which is based on fitting a Gaussian distribution to each cluster.

SLIDE 56

July 5, 2006 56

Future Work

Comparison with Latent Semantic Analysis

(LSA)

Improving the quality of automatically

generated cluster labels

Develop ensembles of cluster stopping methods
Explore the effect of automatically generated

stoplists

SLIDE 57

July 5, 2006 57

Links

SenseClusters

Project: http://senseclusters.sourceforge.net/ Web‐interface: http://marimba.d.umn.edu/cgi‐bin/SC‐cgi/index.cgi

NameConflate and other Data generation utilities

– http://www.d.umn.edu/~tpederse/tools.html

Data and Publications

– http://www.d.umn.edu/~tpederse/data.html – http://www.d.umn.edu/~tpederse/senseclusters‐pubs.html