[PPT] - Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of PowerPoint Presentation

SLIDE 1

Exploring PropBanks for English and Hindi

Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder

SLIDE 2

Why is semantic information important?

Imagine an automatic question answering system
Who created the first effective polio vaccine?
Two possible choices:

– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

SLIDE 3

Question Answering

Who created the first effective polio vaccine?

– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

SLIDE 4

Question Answering

Who created the first effective polio vaccine?

– [Becton Dickinson] created the [first disposable syringe] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh

SLIDE 5

Question Answering

Who created the first effective polio vaccine?

– [Becton Dickinsonagent] created the [first disposable syringetheme] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh

SLIDE 6

Question Answering

We need semantic information to prefer the

right answer

The theme of create should be ‘the first

effective polio vaccine’

The theme in the first sentence was ‘the first

disposable syringe’

We can filter out the wrong answer

SLIDE 7

We need semantic information

To find out about events and their participants
To capture semantic information across

syntactic variation

SLIDE 8

Semantic information

Semantic information about verbs and

participants expressed through semantic roles

Agent, Experiencer, Theme, Result etc.
However, difficult to have a standard set of

thematic roles

SLIDE 9

Proposition Bank

Proposition Bank (PropBank) provides a way

to carry out general purpose Semantic role labelling

A PropBank is a large annotated corpus of

predicate-argument information

A set of semantic roles is defined for each verb
A syntactically parsed corpus is then tagged

with verb-specific semantic role information

SLIDE 10

Outline

English PropBank
Background
Annotation
Frame files & Tagset
Hindi PropBank development
Adapting Frame files
Light verbs
Mapping from dependency labels

SLIDE 11

Proposition Bank

The first (English) PropBank was created on a

1 million syntactically parsed Wall Street Journal corpus

PropBank annotation has also been done on

different genres e.g. web text, biomedical text

Arabic, Chinese & Hindi PropBanks have been

created

SLIDE 12

English PropBank

English PropBank envisioned as the next level
f Penn Treebank (Kingsbury & Palmer, 2003)
Added a layer of predicate-argument

information to the Penn Treebank

Broad in its coverage- covering every instance
f a verb and its semantic arguments in the

corpus

Amenable to collecting representative

statistics

SLIDE 13

English PropBank Annotation

Two steps are involved in annotation

– Choose a sense ID for the predicate – Annotate the arguments of that predicate with semantic roles

This requires two components: frame files and

PropBank tagset

SLIDE 14

PropBank Frame files

PropBank defines semantic roles on a verb-by-

verb basis

This is defined in a verb lexicon consisting of

frame files

Each predicate will have a set of roles

associated with a distinct usage

A polysemous predicate can have several

rolesets within its frame file

SLIDE 15

An example

John rings the bell

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for

SLIDE 16

An example

John rings the bell
Tall aspen trees ring the lake

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

SLIDE 17

An example

[John] rings [the bell]
[Tall aspen trees] ring [the lake]

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

Ring.01 Ring.02

SLIDE 18

An example

[JohnARG0] rings [the bellARG1]
[Tall aspen treesARG1] ring [the lakeARG2]

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

Ring.01 Ring.02

SLIDE 19

Frame files

The Penn Treebank had about 3185 unique

lemmas (Palmer, Gildea, Kingsbury, 2005)

Most frequently occurring verb: say
Small number of verbs had several framesets

e.g. go, come, take, make

Most others had only one frameset per file

SLIDE 20

PropBank annotation pane in Jubilee

SLIDE 21

English PropBank Tagset

Numbered arguments Arg0, Arg1, and so on

until Arg4

Modifiers with function tags e.g. ArgM-LOC

(location) , ArgM-TMP (time), ArgM-PRP (purpose)

Modifiers give additional information about when,

where or how the event occurred

SLIDE 22

PropBank tagset

Numbered Argument Description Arg0 Agent, causer, experiencer Arg1 Theme, patient Arg2 Instrument, benefactive, attribute Arg3 starting point, benefactive, attribute Arg4 ending point

Correspond to the valency requirements of the verb
Or, those that occur with high frequency with that verb

SLIDE 23

PropBank tagset

15 modifier labels for English PropBank
[HeArg0] studied [economic growthArg1] [in

IndiaArgM-LOC]

Modifier Description ArgM-LOC Location ArgM-TMP Time ArgM-GOL Goal ArgM-MNR Manner ArgM-CAU Cause ArgM-ADV Adverbial

SLIDE 24

PropBank tagset

Verb specific and more generalized
Arg0 and Arg1 correspond to Dowty’s Proto

Roles

Leverage the commonalities among semantic

roles

Agents, causers, experiencers – Arg0
Undergoers, patients, themes- Arg1

SLIDE 25

PropBank tagset

While annotating Arg0 and Arg1:

– Unaccusative verbs take Arg1 as their subject argument

[The windowArg1] broke

– Unergatives will take Arg0

[JohnArg0] sang
Distinction is also made between internally

caused events (blush: Arg0) & externally caused events (redden: Arg1)

SLIDE 26

PropBank tagset

How might these map to the more familiar

thematic roles?

Yi, Loper & Palmer (2007) describe such a

mapping to VerbNet roles

SLIDE 27

More frequent Arg0 and Arg1

(85%) are learnt more easily by automatic systems

Arg2 is less frequent, maps to more

than one thematic role

Arg3-5 are even more infrequent

SLIDE 28

Using PropBank

As a computational resource

– Train semantic role labellers (Pradhan et al, 2005) – Question answering systems (with FrameNet) – Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005)

For linguists, to study various phenomena related

to predicate-argument structure

SLIDE 29

Outline

English PropBank
Background
Annotation
Frame files & Tagset
Hindi PropBank development
Adapting Frame files
Light verbs
Mapping from dependency labels

SLIDE 30

Developing PropBank for Hindi-Urdu

Hindi-Urdu PropBank is part of a project to

develop a Multi-layered and multi- representational treebank for Hindi-Urdu

– Hindi Dependency Treebank – Hindi PropBank – Hindi Phrase Structure Treebank

Ongoing project at CU-Boulder

SLIDE 31

Hindi-Urdu PropBank

Corpus of 400,000 words for Hindi
Smaller corpus of 150,000 words for Urdu
Hindi corpus consists of newswire text from

‘Amar Ujala’

So far..

– 220 verb frames – ~100K words annotated

SLIDE 32

Developing Hindi PropBank

Making a PropBank resource for a new

language

– Linguistic differences

Capturing relevant language-specific phenomena

– Annotation practices

Maintain similar annotation practices

– Consistency across PropBanks

SLIDE 33

Developing Hindi PropBank

PropBank annotation for English, Chinese &

Arabic was done on top of phrase structure trees

Hindi PropBank is annotated on dependency

trees

SLIDE 34

Dependency tree

Represent relations that hold between

constituents (chunks)

Karaka labels show the relations between

head verb and its dependents

दि ये gave k1 k2 पैसे money औरत को woman dat राम ने Raam erg k4

SLIDE 35

Hindi PropBank

There are three components to the

annotation

Hindi Frame file creation
Insertion of empty categories
Semantic role labelling

SLIDE 36

Hindi PropBank

There are three components to the

annotation

Hindi Frame file creation
Insertion of empty categories
Semantic role labelling
Both frame creation and labelling require new

strategies for Hindi

SLIDE 37

Hindi PropBank

Hindi frame files were adapted to include

– Morphological causatives – Unaccusative verbs – Experiencers

Additionally, changes had to be made to

analyze the large number (nearly 40%) of light verbs

SLIDE 38

Causatives

Verbs that are related via morphological

derivation are grouped together as individual predicates in the same frame file.

E.g. Cornerstone’s multi-lemma mode is used for the

verbs खा (KA ‘eat’, खखऱा (KilA ‘feed’ and खखऱवा (KilvA ‘cause to feed’)

SLIDE 39

Causatives

raam neArg0 khaanaArg1 khaayaa

Ram erg food eat-perf

‘Ram ate the food’ mohan neArg0 raam koArg2 khaanaArg1 khilaayaa

Mohan erg Ram dat food eat-caus-perf

‘Mohan made Ram eat the food’ sita neArgC mohan seArgA raam koArg2 khaanaArg1 khilvaayaa

Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf

‘Sita, through Mohan made Ram eat the food ’

Roleset id: KA.01 to eat Arg0 eater Arg1 the thing being eaten Roleset id: KilA.01 to feed Arg0 feeder Arg2 eater Arg1 the thing being eaten Roleset id: KilvA.01 to cause to be fed ArgC Causer of feeding ArgA feeder Arg2 Eater Arg1 the thing eaten

SLIDE 40

Unaccusatives

PropBank needs to distinguish proto agents

and proto patients

Sudha danced –

– Unergative, animate agentive arguments- Arg0

The door opened

– Unaccusative, non animate patient arguments- Arg1

For English, these distinctions were available

in VerbNet

For Hindi, various diagnostic tests are applied

to make this distinction

SLIDE 41

Yes No- Second Stage

Ergative. E.g. naac, dauRa,

bEtha? First stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa No-Third stage Yes.Unacc-usative. Eliminate bEtha

No. Take the majority

vote on the tests.

cognate object & ergative case tests Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects For non-alternating verbs and others that remain, the verb will be unaccusative if:

Impersonal passives are not possible
use of ‘huaa’ (past participial relative) is possible
the inanimate subject appears without overt genitive

Unaccusativity diagnostics (5)

SLIDE 42

Unaccusative verb : महक mahak ‘to smell good’ Arg1 entity that smells good

impersonal passive test
past participial relative test

Unaccusativity

This information is then captured in the frame

SLIDE 43

Experiencers

Arg0 includes agents, causers, experiencers
In Hindi, the experiencer subjects occur with dikhnaa (to

glimpse), milnaa (to find), suujhnaa (to be struck with) etc.

Typically marked with dative case

– Mujh-ko chaand dikhaa

I. Dat moon glimpse.pf

(I glimpsed the moon)

PropBank labels these as Arg0, with an additional

descriptor field ‘experiencer’

– Internally caused events

SLIDE 44

Complex predicates

A large number of complex predicates in Hindi

– Noun-verb complex predicates

vichaar karna; think do (to think)
dhyaan denaa; attention give (to pay attention)

– Adjectives can also occur

accha lagnaa: good seem (to like)

– the predicating element is not the verb alone, but forms a composite with the noun/adj

SLIDE 45

Complex predicates

There are also a number of verb-verb complex

predicates

Combine with the bare form of the verb to convey

different meanings

ro paRaa; cry lie- burst out crying
Add some aspectual meaning to the verb
As these occur with only a certain set of verbs,

we find them automatically, based on part of speech tag

SLIDE 46

Complex predicates

However, the noun-verb complex predicates

are more pervasive

They occur with a large number of nouns
Frame files for the nominal predicate need to

be created

E.g. in a sample of 90K words, the light verb

kar; ‘do’ occurred 1738 times with 298 different nominals

SLIDE 47

Complex predicates

Annotation strategy
Additional resources (frame files)
Cluster the large number of nominals?

SLIDE 48

Light verb annotation

A convention for the annotation of light verbs has

been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010)

Annotation is carried out in three passes

– Manual identification of light verb – Annotation of arguments based on frame file – Deterministic merging of the light verb and ‘true’ predicate

In Hindi, this process may be simplified because
f the dependency label ‘pof’ that identifies a

light verb

SLIDE 49

राम ने पैसे चोरी कीए

Raam erg Money theft Do.perf ‘Ram stole the money’

1. Identify the N+V sequences that are complex predicates.
2. Annotate predicating expression with ARG-PRX.

REL: kar ARG-PRX: corii 3. Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame 4. Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese

SLIDE 50

Hindi PropBank tagset

24 labels

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument

50

SLIDE 51

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source Function tags

51

SLIDE 52

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer Function tags Causative

52

SLIDE 53

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer ARGM-VLV Verb-verb construction ARGM-PRX Noun-verb construction Function tags Causative Complex predicate

53

SLIDE 54

Annotating Hindi PropBank

Label Description Label Description ARGM-ADV adverb ARGM-CAU cause ARGM-DIR direction ARGM-DIS discourse ARGM-EXT extent ARGM-LOC location ARGM-MNR manner ARGM-MNS means ARGM-MOD modal ARGM-NEG negation ARGM-PRP purpose ARGM-TMP time

Other modifier labels

54

SLIDE 55

Hindi PropBank annotation

Annotation pane Frameset file display

SLIDE 56

Annotation practice

Need to maintain good annotation practices
Current practices- double blind, followed by

adjudication

Inter-annotator agreement measures the

consistency in the annotation task

English PropBank had high inter annotator

agreement K=0.91

SLIDE 57

Hindi PropBank annotation

Improve consistency & annotation speed
PropBank annotation on dependency trees

has some advantages

Hindi Treebank uses a large set of dependency

labels that have rich semantic information

दि ये gave Arg0 k1 Arg1 k2 Arg2 k4 पैसे money औरत को woman dat राम ने Raam erg

SLIDE 58

Deriving PropBank labels from dependencies

We can derive Hindi PropBank labels from

Dependency labels

Mapping will reduce annotation effort,

improve speed

The dependency tagset has labels that are in

some ways fairly similar to PropBank

– Verb specific labels k1- 5 – Verb modifier labels k7p, k7t, rh etc.

SLIDE 59

Label comparison

Using linguistic intuition, we can compare HDT labels

with the numbered arguments in HPB

59

SLIDE 60

Label comparison

Similarly, linguistic intuition gives us the mapping from HDT

for HPB modifiers

60

SLIDE 61

Label comparison

These mappings are included in the PB frame files,

for example, the verb ‘A: to come’

Only for numbered arguments
Basis for the linguistically motivated rules

Roleset Usage Rule A.01 To come (path) k1  Arg1 k2p Arg2-GOL A.03 to arrive k1  Arg0 k2p Arg2-GOL k5 Arg2-SOU

61

SLIDE 62

Automatic mapping of DT to PB

A rule based, probabilistic system for

automatic mapping

We use two kinds of resources:

– Annotated corpus [Treebank+ PropBank]

32,300 tokens, 2005 predicates

– Frame files with mapping rules

62

SLIDE 63

Argument classification

We use three kinds of rules to carry out

automatic mapping

– Deterministic rules

Dependency label mapped directly onto PropBank

– Empirically derived rules

Using corpus statistics associated with dependency &

PropBank labels

– Linguistically motivated rules

Derived from linguistic intuition & captured in frame

files

63

SLIDE 64

Example of the rules

Features Count PropBank labels xe.01_active_k1 (give) 32 Arg0: 0.93 Arg1: 0.03 Arg2: 0.03 xe.01_active_k2 65 Arg1: 0.95 Arg2: 0.01 Arg0: 0.01 xe.01_active_k4 34 Arg2: 0.94 Arg0: 0.02

Associate the probability of each label in combination with a

particular feature tuple

We use only 3: roleset ID, voice, dependency label
For the verb give, we get the correct mapping to the Hindi

labels

SLIDE 65

Evaluation

32,300 tokens of annotated data
Ten-fold cross validation for evaluation

65

SLIDE 66

Results

PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44

66

SLIDE 67

Results

PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44 Numbered Argument Accuracy

PRECISION RECALL F-SCORE Empirically derived rules 93.63 58.76 72.21 Linguistically motivated rules 91.87 72.36 80.96

67

SLIDE 68

Evaluation

Linguistically motivated rules improve the

recall with a slight drop in the precision

With the most frequent PropBank labels,

empirically derived rules perform well

More data should improve the performance

for modifier arguments

68

SLIDE 69

Evaluation

Annotation practices also affect the mapping

– PB labels are coarse-grained: E.g. ArgM-ADV maps to four different dependency labels – PB are fine-grained: E.g. ‘means’ and ‘causes’ are distinguished (ArgM-MNS, ArgM-CAU)but are lumped together under a single label in dependency treebank

69

SLIDE 70

Evaluation

Our goal is to find a useful set of mappings to

use at a pre-annotation stage

We expect that with more PropBanked data,

empirically derived rules will perform better

Using additional resources such as Hindi

WordNet can help to make up for the lack of training data

70

SLIDE 71

Summary

Semantic role labels
English PropBank

– Frame files – PropBank tagset

Development of Hindi PropBank

– Linguistic issues – Experiments

SLIDE 72