Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of - - PowerPoint PPT Presentation
Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of - - PowerPoint PPT Presentation
Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder Why is semantic information important? Imagine an automatic question answering system Who created the first effective polio
Why is semantic information important?
- Imagine an automatic question answering system
- Who created the first effective polio vaccine?
- Two possible choices:
– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering
- Who created the first effective polio vaccine?
– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering
- Who created the first effective polio vaccine?
– [Becton Dickinson] created the [first disposable syringe] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh
Question Answering
- Who created the first effective polio vaccine?
– [Becton Dickinsonagent] created the [first disposable syringetheme] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh
Question Answering
- We need semantic information to prefer the
right answer
- The theme of create should be ‘the first
effective polio vaccine’
- The theme in the first sentence was ‘the first
disposable syringe’
- We can filter out the wrong answer
We need semantic information
- To find out about events and their participants
- To capture semantic information across
syntactic variation
Semantic information
- Semantic information about verbs and
participants expressed through semantic roles
- Agent, Experiencer, Theme, Result etc.
- However, difficult to have a standard set of
thematic roles
Proposition Bank
- Proposition Bank (PropBank) provides a way
to carry out general purpose Semantic role labelling
- A PropBank is a large annotated corpus of
predicate-argument information
- A set of semantic roles is defined for each verb
- A syntactically parsed corpus is then tagged
with verb-specific semantic role information
Outline
- English PropBank
- Background
- Annotation
- Frame files & Tagset
- Hindi PropBank development
- Adapting Frame files
- Light verbs
- Mapping from dependency labels
Proposition Bank
- The first (English) PropBank was created on a
1 million syntactically parsed Wall Street Journal corpus
- PropBank annotation has also been done on
different genres e.g. web text, biomedical text
- Arabic, Chinese & Hindi PropBanks have been
created
English PropBank
- English PropBank envisioned as the next level
- f Penn Treebank (Kingsbury & Palmer, 2003)
- Added a layer of predicate-argument
information to the Penn Treebank
- Broad in its coverage- covering every instance
- f a verb and its semantic arguments in the
corpus
- Amenable to collecting representative
statistics
English PropBank Annotation
- Two steps are involved in annotation
– Choose a sense ID for the predicate – Annotate the arguments of that predicate with semantic roles
- This requires two components: frame files and
PropBank tagset
PropBank Frame files
- PropBank defines semantic roles on a verb-by-
verb basis
- This is defined in a verb lexicon consisting of
frame files
- Each predicate will have a set of roles
associated with a distinct usage
- A polysemous predicate can have several
rolesets within its frame file
An example
- John rings the bell
ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for
An example
- John rings the bell
- Tall aspen trees ring the lake
ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity
An example
- [John] rings [the bell]
- [Tall aspen trees] ring [the lake]
ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity
Ring.01 Ring.02
An example
- [JohnARG0] rings [the bellARG1]
- [Tall aspen treesARG1] ring [the lakeARG2]
ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity
Ring.01 Ring.02
Frame files
- The Penn Treebank had about 3185 unique
lemmas (Palmer, Gildea, Kingsbury, 2005)
- Most frequently occurring verb: say
- Small number of verbs had several framesets
e.g. go, come, take, make
- Most others had only one frameset per file
PropBank annotation pane in Jubilee
English PropBank Tagset
- Numbered arguments Arg0, Arg1, and so on
until Arg4
- Modifiers with function tags e.g. ArgM-LOC
(location) , ArgM-TMP (time), ArgM-PRP (purpose)
- Modifiers give additional information about when,
where or how the event occurred
PropBank tagset
Numbered Argument Description Arg0 Agent, causer, experiencer Arg1 Theme, patient Arg2 Instrument, benefactive, attribute Arg3 starting point, benefactive, attribute Arg4 ending point
- Correspond to the valency requirements of the verb
- Or, those that occur with high frequency with that verb
PropBank tagset
- 15 modifier labels for English PropBank
- [HeArg0] studied [economic growthArg1] [in
IndiaArgM-LOC]
Modifier Description ArgM-LOC Location ArgM-TMP Time ArgM-GOL Goal ArgM-MNR Manner ArgM-CAU Cause ArgM-ADV Adverbial
PropBank tagset
- Verb specific and more generalized
- Arg0 and Arg1 correspond to Dowty’s Proto
Roles
- Leverage the commonalities among semantic
roles
- Agents, causers, experiencers – Arg0
- Undergoers, patients, themes- Arg1
PropBank tagset
- While annotating Arg0 and Arg1:
– Unaccusative verbs take Arg1 as their subject argument
- [The windowArg1] broke
– Unergatives will take Arg0
- [JohnArg0] sang
- Distinction is also made between internally
caused events (blush: Arg0) & externally caused events (redden: Arg1)
PropBank tagset
- How might these map to the more familiar
thematic roles?
- Yi, Loper & Palmer (2007) describe such a
mapping to VerbNet roles
- More frequent Arg0 and Arg1
(85%) are learnt more easily by automatic systems
- Arg2 is less frequent, maps to more
than one thematic role
- Arg3-5 are even more infrequent
Using PropBank
- As a computational resource
– Train semantic role labellers (Pradhan et al, 2005) – Question answering systems (with FrameNet) – Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005)
- For linguists, to study various phenomena related
to predicate-argument structure
Outline
- English PropBank
- Background
- Annotation
- Frame files & Tagset
- Hindi PropBank development
- Adapting Frame files
- Light verbs
- Mapping from dependency labels
Developing PropBank for Hindi-Urdu
- Hindi-Urdu PropBank is part of a project to
develop a Multi-layered and multi- representational treebank for Hindi-Urdu
– Hindi Dependency Treebank – Hindi PropBank – Hindi Phrase Structure Treebank
- Ongoing project at CU-Boulder
Hindi-Urdu PropBank
- Corpus of 400,000 words for Hindi
- Smaller corpus of 150,000 words for Urdu
- Hindi corpus consists of newswire text from
‘Amar Ujala’
- So far..
– 220 verb frames – ~100K words annotated
Developing Hindi PropBank
- Making a PropBank resource for a new
language
– Linguistic differences
- Capturing relevant language-specific phenomena
– Annotation practices
- Maintain similar annotation practices
– Consistency across PropBanks
Developing Hindi PropBank
- PropBank annotation for English, Chinese &
Arabic was done on top of phrase structure trees
- Hindi PropBank is annotated on dependency
trees
Dependency tree
- Represent relations that hold between
constituents (chunks)
- Karaka labels show the relations between
head verb and its dependents
दि ये gave k1 k2 पैसे money औरत को woman dat राम ने Raam erg k4
Hindi PropBank
- There are three components to the
annotation
- Hindi Frame file creation
- Insertion of empty categories
- Semantic role labelling
Hindi PropBank
- There are three components to the
annotation
- Hindi Frame file creation
- Insertion of empty categories
- Semantic role labelling
- Both frame creation and labelling require new
strategies for Hindi
Hindi PropBank
- Hindi frame files were adapted to include
– Morphological causatives – Unaccusative verbs – Experiencers
- Additionally, changes had to be made to
analyze the large number (nearly 40%) of light verbs
Causatives
- Verbs that are related via morphological
derivation are grouped together as individual predicates in the same frame file.
- E.g. Cornerstone’s multi-lemma mode is used for the
verbs खा (KA ‘eat’, खखऱा (KilA ‘feed’ and खखऱवा (KilvA ‘cause to feed’)
Causatives
raam neArg0 khaanaArg1 khaayaa
Ram erg food eat-perf
‘Ram ate the food’ mohan neArg0 raam koArg2 khaanaArg1 khilaayaa
Mohan erg Ram dat food eat-caus-perf
‘Mohan made Ram eat the food’ sita neArgC mohan seArgA raam koArg2 khaanaArg1 khilvaayaa
Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf
‘Sita, through Mohan made Ram eat the food ’
Roleset id: KA.01 to eat Arg0 eater Arg1 the thing being eaten Roleset id: KilA.01 to feed Arg0 feeder Arg2 eater Arg1 the thing being eaten Roleset id: KilvA.01 to cause to be fed ArgC Causer of feeding ArgA feeder Arg2 Eater Arg1 the thing eaten
Unaccusatives
- PropBank needs to distinguish proto agents
and proto patients
- Sudha danced –
– Unergative, animate agentive arguments- Arg0
- The door opened
– Unaccusative, non animate patient arguments- Arg1
- For English, these distinctions were available
in VerbNet
- For Hindi, various diagnostic tests are applied
to make this distinction
Yes No- Second Stage
- Ergative. E.g. naac, dauRa,
bEtha? First stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa No-Third stage Yes.Unacc-usative. Eliminate bEtha
- No. Take the majority
vote on the tests.
cognate object & ergative case tests Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects For non-alternating verbs and others that remain, the verb will be unaccusative if:
- Impersonal passives are not possible
- use of ‘huaa’ (past participial relative) is possible
- the inanimate subject appears without overt genitive
Unaccusativity diagnostics (5)
Unaccusative verb : महक mahak ‘to smell good’ Arg1 entity that smells good
- impersonal passive test
- past participial relative test
Unaccusativity
- This information is then captured in the frame
Experiencers
- Arg0 includes agents, causers, experiencers
- In Hindi, the experiencer subjects occur with dikhnaa (to
glimpse), milnaa (to find), suujhnaa (to be struck with) etc.
- Typically marked with dative case
– Mujh-ko chaand dikhaa
- I. Dat moon glimpse.pf
(I glimpsed the moon)
- PropBank labels these as Arg0, with an additional
descriptor field ‘experiencer’
– Internally caused events
Complex predicates
- A large number of complex predicates in Hindi
– Noun-verb complex predicates
- vichaar karna; think do (to think)
- dhyaan denaa; attention give (to pay attention)
– Adjectives can also occur
- accha lagnaa: good seem (to like)
– the predicating element is not the verb alone, but forms a composite with the noun/adj
Complex predicates
- There are also a number of verb-verb complex
predicates
- Combine with the bare form of the verb to convey
different meanings
- ro paRaa; cry lie- burst out crying
- Add some aspectual meaning to the verb
- As these occur with only a certain set of verbs,
we find them automatically, based on part of speech tag
Complex predicates
- However, the noun-verb complex predicates
are more pervasive
- They occur with a large number of nouns
- Frame files for the nominal predicate need to
be created
- E.g. in a sample of 90K words, the light verb
kar; ‘do’ occurred 1738 times with 298 different nominals
Complex predicates
- Annotation strategy
- Additional resources (frame files)
- Cluster the large number of nominals?
Light verb annotation
- A convention for the annotation of light verbs has
been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010)
- Annotation is carried out in three passes
– Manual identification of light verb – Annotation of arguments based on frame file – Deterministic merging of the light verb and ‘true’ predicate
- In Hindi, this process may be simplified because
- f the dependency label ‘pof’ that identifies a
light verb
राम ने पैसे चोरी कीए
Raam erg Money theft Do.perf ‘Ram stole the money’
- 1. Identify the N+V sequences that are complex predicates.
- 2. Annotate predicating expression with ARG-PRX.
REL: kar ARG-PRX: corii 3. Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame 4. Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese
Hindi PropBank tagset
24 labels
Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument
50
Annotating Hindi PropBank
Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source Function tags
51
Annotating Hindi PropBank
Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer Function tags Causative
52
Annotating Hindi PropBank
Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer ARGM-VLV Verb-verb construction ARGM-PRX Noun-verb construction Function tags Causative Complex predicate
53
Annotating Hindi PropBank
Label Description Label Description ARGM-ADV adverb ARGM-CAU cause ARGM-DIR direction ARGM-DIS discourse ARGM-EXT extent ARGM-LOC location ARGM-MNR manner ARGM-MNS means ARGM-MOD modal ARGM-NEG negation ARGM-PRP purpose ARGM-TMP time
- Other modifier labels
54
Hindi PropBank annotation
Annotation pane Frameset file display
Annotation practice
- Need to maintain good annotation practices
- Current practices- double blind, followed by
adjudication
- Inter-annotator agreement measures the
consistency in the annotation task
- English PropBank had high inter annotator
agreement K=0.91
Hindi PropBank annotation
- Improve consistency & annotation speed
- PropBank annotation on dependency trees
has some advantages
- Hindi Treebank uses a large set of dependency
labels that have rich semantic information
दि ये gave Arg0 k1 Arg1 k2 Arg2 k4 पैसे money औरत को woman dat राम ने Raam erg
Deriving PropBank labels from dependencies
- We can derive Hindi PropBank labels from
Dependency labels
- Mapping will reduce annotation effort,
improve speed
- The dependency tagset has labels that are in
some ways fairly similar to PropBank
– Verb specific labels k1- 5 – Verb modifier labels k7p, k7t, rh etc.
Label comparison
- Using linguistic intuition, we can compare HDT labels
with the numbered arguments in HPB
59
Label comparison
- Similarly, linguistic intuition gives us the mapping from HDT
for HPB modifiers
60
Label comparison
- These mappings are included in the PB frame files,
for example, the verb ‘A: to come’
- Only for numbered arguments
- Basis for the linguistically motivated rules
Roleset Usage Rule A.01 To come (path) k1 Arg1 k2p Arg2-GOL A.03 to arrive k1 Arg0 k2p Arg2-GOL k5 Arg2-SOU
61
Automatic mapping of DT to PB
- A rule based, probabilistic system for
automatic mapping
- We use two kinds of resources:
– Annotated corpus [Treebank+ PropBank]
- 32,300 tokens, 2005 predicates
– Frame files with mapping rules
62
Argument classification
- We use three kinds of rules to carry out
automatic mapping
– Deterministic rules
- Dependency label mapped directly onto PropBank
– Empirically derived rules
- Using corpus statistics associated with dependency &
PropBank labels
– Linguistically motivated rules
- Derived from linguistic intuition & captured in frame
files
63
Example of the rules
Features Count PropBank labels xe.01_active_k1 (give) 32 Arg0: 0.93 Arg1: 0.03 Arg2: 0.03 xe.01_active_k2 65 Arg1: 0.95 Arg2: 0.01 Arg0: 0.01 xe.01_active_k4 34 Arg2: 0.94 Arg0: 0.02
- Associate the probability of each label in combination with a
particular feature tuple
- We use only 3: roleset ID, voice, dependency label
- For the verb give, we get the correct mapping to the Hindi
labels
Evaluation
- 32,300 tokens of annotated data
- Ten-fold cross validation for evaluation
65
Results
PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44
66
Results
PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44 Numbered Argument Accuracy
PRECISION RECALL F-SCORE Empirically derived rules 93.63 58.76 72.21 Linguistically motivated rules 91.87 72.36 80.96
67
Evaluation
- Linguistically motivated rules improve the
recall with a slight drop in the precision
- With the most frequent PropBank labels,
empirically derived rules perform well
- More data should improve the performance
for modifier arguments
68
Evaluation
- Annotation practices also affect the mapping
– PB labels are coarse-grained: E.g. ArgM-ADV maps to four different dependency labels – PB are fine-grained: E.g. ‘means’ and ‘causes’ are distinguished (ArgM-MNS, ArgM-CAU)but are lumped together under a single label in dependency treebank
69
Evaluation
- Our goal is to find a useful set of mappings to
use at a pre-annotation stage
- We expect that with more PropBanked data,
empirically derived rules will perform better
- Using additional resources such as Hindi
WordNet can help to make up for the lack of training data
70
Summary
- Semantic role labels
- English PropBank
– Frame files – PropBank tagset
- Development of Hindi PropBank