[PPT] - A Study of Hybrid Similarity Measures for Semantic Relation PowerPoint Presentation

SLIDE 1

Introduction Methodology Evaluation Results Conclusion

A Study of Hybrid Similarity Measures for Semantic Relation Extraction

Alexander Panchenko and Olga Morozova, Center for Natural Language Processing (CENTAL), Universit´ e catholique de Louvain, Belgium

alexander.panchenko@student.uclouvain.be olga.morozova@uclouvain.be

April 23, 2012

1 / 41

SLIDE 2

Introduction Methodology Evaluation Results Conclusion

Plan

Introduction Methodology Evaluation Results Conclusion

2 / 41

SLIDE 3

Introduction Methodology Evaluation Results Conclusion

Semantic Relations

Semantic relations are useful for NLP and IR applications:
Query expansion (Hsu et al., 2006)
QA systems (Sun et al., 2005)
Text categorization (Tikk et al, 2003)
Word Sense Disambiguation (Patwardhan et al., 2003)

3 / 41

SLIDE 4

Introduction Methodology Evaluation Results Conclusion

Semantic Relations

Semantic relations are useful for NLP and IR applications:
Query expansion (Hsu et al., 2006)
QA systems (Sun et al., 2005)
Text categorization (Tikk et al, 2003)
Word Sense Disambiguation (Patwardhan et al., 2003)
Semantic resources: thesauri, ontologies, synonymy

dictionaries, WordNets, . . .

In the context of this work we consider following relation

types:

synonyms: car, SYN, vehicle, animal, SYN, beast
hypernyms :

car, HYPER, Jeep Cherokee, animal, HYPER, crocodile

co-hyponyms (have a common hypernym):

Toyota Land Cruiser, COHYPER, Jeep Cherokee

4 / 41

SLIDE 5

Introduction Methodology Evaluation Results Conclusion

Motivation

Figure: Semantic relations in the EuroVoc thesaurus.

5 / 41

SLIDE 6

Introduction Methodology Evaluation Results Conclusion

Motivation

Figure: Semantic relations in the EuroVoc thesaurus.

Manual construction of relations:
(+) Precision
(–) Very expensive and time-consuming
Existing relation extraction methods:
(+) No or very little manual labor
(–) Not as precise as manual construction

6 / 41

SLIDE 7

Introduction Methodology Evaluation Results Conclusion

Motivation

Figure: Semantic relations in the EuroVoc thesaurus.

Manual construction of relations:
(+) Precision
(–) Very expensive and time-consuming
Existing relation extraction methods:
(+) No or very little manual labor
(–) Not as precise as manual construction
=

⇒ development of new relation extraction methods:

Input: terms C
Output: semantic relations ˆ

R ⊂ C × C

7 / 41

SLIDE 8

Introduction Methodology Evaluation Results Conclusion

The State of Art

A multitude of complimentary measures were proposed to

extract synonyms, hypernyms, and co-hyponyms

Most of them are based on one of the 5 key approaches:
1. distributional analysis (Lin, 1998b)
2. Web as a corpus (Cilibrasi and Vitanyi, 2007)
3. lexico-syntactic patterns (Bollegala et al., 2007)
4. semantic networks (Resnik, 1995)
5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a)

8 / 41

SLIDE 9

Introduction Methodology Evaluation Results Conclusion

The State of Art

A multitude of complimentary measures were proposed to

extract synonyms, hypernyms, and co-hyponyms

Most of them are based on one of the 5 key approaches:
1. distributional analysis (Lin, 1998b)
2. Web as a corpus (Cilibrasi and Vitanyi, 2007)
3. lexico-syntactic patterns (Bollegala et al., 2007)
4. semantic networks (Resnik, 1995)
5. definitions of dictionaries or encyclopedias (Zesch et al., 2008a)
Some attempts were made to combine measures (Curran,

2002; Cederberg and Widdows, 2003; Mihalcea et al., 2006; Agirre et al., 2009; Yang and Callan, 2009)

However, most studies are still not taking into account all 5

existing extraction approaches.

9 / 41

SLIDE 10

Introduction Methodology Evaluation Results Conclusion

Contributions

A systematic analysis of
16 baseline similarity measures of 5 key extraction principles
their combinations with 8 fusion methods and 3 techniques

for the combination set selection

We are first to propose hybrid similarity measures based on

all the 5 key extraction approaches:

1. distributional analysis
2. Web as a corpus
3. lexico-syntactic patterns
4. semantic networks
5. definitions of dictionaries or encyclopedias
The best found hybrid measure combines 15 baseline measures

with the Logistic Regression

10 / 41

SLIDE 11

Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction

Plan

Introduction Methodology Similarity-based Relation Extraction Single Similarity Measures Hybrid Similarity Measures Evaluation Results Conclusion

11 / 41

SLIDE 12

Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction

Similarity-based Relation Extraction

Terms, C simi (a) (b) combination method Scmb S1 SN sim1 S1 simN norm SN ... ... norm norm Scmb knn R Si norm Si knn

Single Similarity Measure Hybrid Similarity Measure

Relations, Terms, C R Relations,

Figure: Single (a) and (b) hybrid similarity-based relation extractors.

simk – a similarity measure simk(ci, cj) ∈ [0; 1], ci, cj ∈ C
Si – term-term similarity matrix (C × C)
knn – k-NN thresholding:

ˆ R = |C|

i=1 {ci, cj : (cj ∈ top k% of ci) ∧ (sij > 0)} .

Scmb – combined similarity matrix obtained with

combination_method(S1, . . . , SN)

12 / 41

SLIDE 13

Introduction Methodology Evaluation Results Conclusion Similarity-based Relation Extraction

Single and Hybrid Similarity Measures

16 single measures
5 measures based on a semantic network
3 web-based measures
5 corpus-based measures
2 distributional
1 lexico-syntactic patterns
2 other co-occurence based
3 definition-based measures
64 hybrid measures
8 combination methods
8 measure sets obtained with 3 measure selection techniques

13 / 41

SLIDE 14

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Plan

Introduction Methodology Similarity-based Relation Extraction Single Similarity Measures Hybrid Similarity Measures Evaluation Results Conclusion

14 / 41

SLIDE 15

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Measures Based on a Semantic Network

1. Wu and Palmer (1994)
2. Leacock and Chodorow (1998)
3. Resnik (1995)
4. Jiang and Conrath (1997)
5. Lin (1998)

Data:

WordNet 3.0
SemCor corpus

Variables:

Lengths of the shortest paths between terms in the network
Probability of terms derived from a corpus

Coverage: 155.287 English terms encoded in WordNet 3.0. Complexity: calculation of a shortest path(s) between the nodes.

15 / 41

SLIDE 16

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Web-based Measures

Normalized Google Distance (NGD) (Cilibrasi and Vitanyi, 2007)

6. NGD-Yahoo!
7. NGD-Bing
8. NGD-Google over wikipedia.org domain

Data: number of times the terms co-occur in the documents as indexed by an IR system. Variables:

number of hits returned by query ”ci”
number of hits returned by query ”ci AND c′′

j

Coverage: huge vocabulary in dozens of languages. Complexity: constraints of a search engine API.

16 / 41

SLIDE 17

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Corpus-based Measures

9. Bag-of-word Distributional Analysis (BDA) (Sahlgren, 2006)
10. Syntactic Distributional Analysis (SDA) (Curran, 2003)

Data: WaCkypedia (800M tokens) and PukWaC (2000M tokens) corpora (Baroni et al., 2009) Variables:

feature vector based on the context window
feature vector based on the syntactic context

Coverage: word should occur in the corpora. Complexity: O(BDA) « O(SDA) because of dependency parsing

17 / 41

SLIDE 18

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Corpus-based Measures

11. A measure based on lexico-syntactic patterns (PatternWiki)

Data: WaCkypedia corpus (800M tokens) Method:

10 patterns for hypernymy extraction: 6 Hearst (1992)

patterns + 4 other patterns

such diverse {[occupations]} as {[doctors]},

{[engineers]} and {[scientists]}[PATTERN=1]

Semantic similarity sij between terms ci, cj ∈ C is a function of

the number of term co-occurences in the same concordance nij: sim(ci, cj) = sij = nij maxij(nij).

18 / 41

SLIDE 19

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Corpus-based Measures

Figure: A UNITEX graph implementing the first extraction pattern.

Coverage: Target terms ci, cj should co-occur in a sentence. Complexity: Application of a cascade of FST to a text.

19 / 41

SLIDE 20

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Corpus-based Measures

12. Latent Semantic Analysis (LSA) on TASA corpus

(Landauer and Dumais, 1997)

13. NGD on Factiva corpus (Veksler et al., 2008)

20 / 41

SLIDE 21

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Definition-based Measures

14. Extended Lesk (Banerjee and Pedersen, 2003)
15. GlossVectors (Patwardhan and Pedersen, 2006)

Data: WordNet glosses. Variables:

bag-of-words vector of a term ci derived from the glosses
relation between words (ci, cj) in the network

Coverage: 117.659 glosses encoded in WordNet 3.0 Complexity: Calculation of a similarity in a vector space.

21 / 41

SLIDE 22

Introduction Methodology Evaluation Results Conclusion Single Similarity Measures

Definition-based Measures

16. WktWiki – BDA on definitions of Wiktionary and Wikipedia 1

Data: Wikipedia abstracts, Wiktionary. Method:

Definition = abstract of Wikipedia article with title ”ci” +

glosses, examples, quotations, related words, categories from Wiktionary for ci

Represent a definition as a bag-of-words vector
Calculate similarities with cosine
Update similarities according to relations in the Wiktionary.

Coverage: Wiktionary: 536.594 glosses, Wikipedia: 3.8M articles Complexity: Cosine calculation in a bag-of-words space.

1The method stems from the work of Zesch et al. (2008) 22 / 41

SLIDE 23

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Plan

Introduction Methodology Similarity-based Relation Extraction Single Similarity Measures Hybrid Similarity Measures Evaluation Results Conclusion

23 / 41

SLIDE 24

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Methods

A goal of a combination method is to produce “better”

similarity scores than the scores of single measures.

A combination method takes as an input {S1, . . . , SK}

produced by K single measures and outputs Scmb.

sk

ij ∈ Sk is a pairwise similarity score of terms ci and cj

produced by k-th measure.

We tested 8 combination methods.

24 / 41

SLIDE 25

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Methods

1. Mean. A mean of K pairwise similarity scores:

Scmb = 1 K

K

k=1

Sk ⇔ scmb

ij

= 1 K

k=1,K

sk

ij .

2. Mean-Nnz. A mean of scores having non-zero value:

scmb

ij

= 1 |k : sk

ij > 0, k = 1, K|

k=1,K

sk

ij .

3. Mean-Zscore. A mean of scores transformed into Z-scores:

Scmb = 1 K

K

k=1

Sk − µk σk , where µk and σk are a mean and a standard deviation of the scores of the k-th measure (Sk).

25 / 41

SLIDE 26

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Methods

4. Median. A median of K pairwise similarities:

scmb

ij

= median(s1

ij, . . . , sK ij ).

5. Max. A maximum of K pairwise similarities:

scmb

ij

= max(s1

ij, . . . , sK ij ).

6. RankFusion. A mean of scores converted to ranks:

scmb

ij

= 1 K

k=1,K

rk

ij ,

where rk

ij is the rank corresponding to the similarity score sk ij .

26 / 41

SLIDE 27

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Methods

7. RelationFusion. Unions the best relations found by each

measure separately. A relation extracted independently by several method has more weight. Input: Similarity matrices of N single measures {S1, . . . , SN}, number of nearest neighbors K Output: Combined similarity matrix Scmb for i=1,N do

1

Ri = knn(Si, k) ;

2

Ri = relation_matrix(Ri)

3

end

4

Scmb = 1

N

i=1 Ri ; 5

return Scmb ;

6

rk

ij ∈ Rk, rk ij =

1 if relation ci, cj ∈ Rk else

27 / 41

SLIDE 28

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Methods

8. Logit. A supervised combination of similarity measures
Training a binary classifier (a Logistic Regression) on a set of

manually constructed semantic relations R (BLESS or SN)

Positive training examples are “meaningful” relations

(synonyms, hyponyms, co-hyponyms, associations)

Negative training examples are pairs of semantically

unrelated words (generated randomly and verified manually).

A relation ci, t, cj ∈ R is represented with an N-dimensional

vector of pairwise similarities: xij = (s1

ij, . . . , sN ij ).

Category yij:

yij = if ci, t, cj is a random relation 1

therwise
Using the model (w1, . . . , wK) to combine measures:

scmb

ij

= 1 1 + e−z , z = w0 +

K

k=1

wksk

ij ,

28 / 41

SLIDE 29

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Sets

A problem

Number of ways to choose which of 16 single measures to combine with one combination method:

16

m=2

C m

16 = 16

m=2

16! m!(16 − m)! = 65.535

Expert choice of measures – 5, 9 and 15 measures
Forward Stepwise Procedure – 7, 8a, 8b, 10 measures
Analysis of a Logistic Regression weights trained on 16

measures – 12 measures

29 / 41

SLIDE 30

Introduction Methodology Evaluation Results Conclusion Hybrid Similarity Measures

Combination Sets

A problem

Number of ways to choose which of 16 single measures to combine with one combination method:

16

m=2

C m

16 = 16

m=2

16! m!(16 − m)! = 65.535

Expert choice of measures – 5, 9 and 15 measures
Forward Stepwise Procedure – 7, 8a, 8b, 10 measures
Analysis of a Logistic Regression weights trained on 16

measures – 12 measures

The best single predictors of relations: C-BDA, C-SDA,

C-LSA-Tasa, D-WktWiki, D-GlossVectors, D-ExtendedLesk.

30 / 41

SLIDE 31

Introduction Methodology Evaluation Results Conclusion

Human Judgement Datasets

term, ci term, cj judgement, s sim, s judgement, r sim, ˆ r tiger cat 7.35 0.85 1 3 book paper 7.46 0.95 2 2 computer keyboard 7.62 0.81 3 1 ... ... ... ... . . . . . . possibility girl 1.94 0.25 64 65 sugar approach 0.88 0.05 65 23

Data:

WordSim353 – 353 term pairs (Finkelstein, 2002)
MC – 30 term pairs (Miller Charles, 1991)
RG – 65 term pairs (Rubenstein Goodenough, 1965)

Criteria:

Pearson correlation: ρ = cov(s,ˆ

s) σ(s)σ(ˆ s)

Spearman’s correlation: r = cov(r,ˆ

r) σ(r)σ(ˆ r)

31 / 41

SLIDE 32

Introduction Methodology Evaluation Results Conclusion

Semantic Relation Datasets: Data

term, ci term, cj relation type, t judge adjudicate syn judge arbitrate syn judge chancellor syn judge sheriff syn ... ... ... judge pc random judge fare random judge lemon random

BLESS (Baroni and Lenci, 2011)
26554 relations
hyperonyms, co-hypernyms, meronyms, associations,

attributes, random relations

SN (Semantic Neighbors)
14682 relations
synonyms, random relations
|Rrandom|/|Rrest| ≈ 0.5

32 / 41

SLIDE 33

Introduction Methodology Evaluation Results Conclusion

Semantic Relation Datasets: Criteria

Based on the number of correctly extracted (ranked) relations.
R – all not random semantic relations
ˆ

R(k) – extracted relations if the number of nearest neighbors is k

Criteria

Precision: P(k) = |R∩ˆ

R(k)| |ˆ R(k)| ,

Recall: R(k) = |R∩ˆ

R(k)| |R|

,

F1-measure: F(k) = 2 · P(k)·R(k)

P(k)+R(k),

MAP M(k) = 1

k

i=1 P(i).

We use P(10), P(20), P(50), R(50), M(20), M(50).

33 / 41

SLIDE 34

Introduction Methodology Evaluation Results Conclusion

Semantic Relation Datasets: Example

Precision P(50%) = 1

7 ≈ 0.86

term, ci term, cj relation type sij aficionado enthusiast syn 0.07197 aficionado fan syn 0.05195 aficionado admirer syn 0.01964 aficionado addict syn 0.01326 aficionado devotee syn 0.01163 aficionado foundling random 0.00777 aficionado fanatic syn 0.00414 aficionado adherent syn 0.00353 aficionado capital random 0.00232 aficionado statute random 0.00029 aficionado blot random 0.00025 aficionado meddler random 0.00005 aficionado enlargement random 0.00003 aficionado bawdyhouse random 0.00000

34 / 41

SLIDE 35

Introduction Methodology Evaluation Results Conclusion

Single Similarity Measures

Figure: Performance of 16 single similarity measures on human judgement datasets (MC, RG, WordSim353). The best scores in a group are in bold.

35 / 41

SLIDE 36

Introduction Methodology Evaluation Results Conclusion

Single Similarity Measures

Figure: Performance of 16 single similarity measures on human judgement datasets (MC, RG, WordSim353) and semantic relation datasets (BLESS and SN). The best scores in a group are in bold.

36 / 41

SLIDE 37

Introduction Methodology Evaluation Results Conclusion

Hybrid Similarity Measures

Figure: Performance of 16 single and 8 hybrid similarity measures on human judgements datasets (MC, RG, WordSim353) and semantic relation datasets (BLESS and SN). The best scores in a group (single/hybrid) are in bold; the very best scores are in grey.

37 / 41

SLIDE 38

Introduction Methodology Evaluation Results Conclusion

Hybrid Similarity Measures

Figure: Precision-Recall graphs calculated on the BLESS dataset of (a) 16 single measures and the best hybrid measure H-Logit-E15; (b) 8 hybrid measures.

38 / 41

SLIDE 39

Introduction Methodology Evaluation Results Conclusion

Conclusion:

We have undertaken a study of 16 baseline measures, 8

combination methods, and 3 measure selection techniques.

The proposed hybrid measures:
use all 5 main types of baseline measures;
outperform the single measures on all datasets.
The best results were provided by
a combination of 15 corpus-, web-, network-, and

definition-based measures

with Logistic Regression
ρ = 0.870, P(20) = 0.987, R(50) = 0.814.

39 / 41

SLIDE 40

Introduction Methodology Evaluation Results Conclusion

Conclusion:

We have undertaken a study of 16 baseline measures, 8

combination methods, and 3 measure selection techniques.

The proposed hybrid measures:
use all 5 main types of baseline measures;
outperform the single measures on all datasets.
The best results were provided by
a combination of 15 corpus-, web-, network-, and

definition-based measures

with Logistic Regression
ρ = 0.870, P(20) = 0.987, R(50) = 0.814.
We are going to apply the developed methods to query

expansion and text classification.

40 / 41

SLIDE 41

Introduction Methodology Evaluation Results Conclusion

Thank you! Questions?

41 / 41