Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of - - PowerPoint PPT Presentation

exploring propbanks for english and hindi
SMART_READER_LITE
LIVE PREVIEW

Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of - - PowerPoint PPT Presentation

Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder Why is semantic information important? Imagine an automatic question answering system Who created the first effective polio


slide-1
SLIDE 1

Exploring PropBanks for English and Hindi

Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder

slide-2
SLIDE 2

Why is semantic information important?

  • Imagine an automatic question answering system
  • Who created the first effective polio vaccine?
  • Two possible choices:

– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

slide-3
SLIDE 3

Question Answering

  • Who created the first effective polio vaccine?

– Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

slide-4
SLIDE 4

Question Answering

  • Who created the first effective polio vaccine?

– [Becton Dickinson] created the [first disposable syringe] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh

slide-5
SLIDE 5

Question Answering

  • Who created the first effective polio vaccine?

– [Becton Dickinsonagent] created the [first disposable syringetheme] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh

slide-6
SLIDE 6

Question Answering

  • We need semantic information to prefer the

right answer

  • The theme of create should be ‘the first

effective polio vaccine’

  • The theme in the first sentence was ‘the first

disposable syringe’

  • We can filter out the wrong answer
slide-7
SLIDE 7

We need semantic information

  • To find out about events and their participants
  • To capture semantic information across

syntactic variation

slide-8
SLIDE 8

Semantic information

  • Semantic information about verbs and

participants expressed through semantic roles

  • Agent, Experiencer, Theme, Result etc.
  • However, difficult to have a standard set of

thematic roles

slide-9
SLIDE 9

Proposition Bank

  • Proposition Bank (PropBank) provides a way

to carry out general purpose Semantic role labelling

  • A PropBank is a large annotated corpus of

predicate-argument information

  • A set of semantic roles is defined for each verb
  • A syntactically parsed corpus is then tagged

with verb-specific semantic role information

slide-10
SLIDE 10

Outline

  • English PropBank
  • Background
  • Annotation
  • Frame files & Tagset
  • Hindi PropBank development
  • Adapting Frame files
  • Light verbs
  • Mapping from dependency labels
slide-11
SLIDE 11

Proposition Bank

  • The first (English) PropBank was created on a

1 million syntactically parsed Wall Street Journal corpus

  • PropBank annotation has also been done on

different genres e.g. web text, biomedical text

  • Arabic, Chinese & Hindi PropBanks have been

created

slide-12
SLIDE 12

English PropBank

  • English PropBank envisioned as the next level
  • f Penn Treebank (Kingsbury & Palmer, 2003)
  • Added a layer of predicate-argument

information to the Penn Treebank

  • Broad in its coverage- covering every instance
  • f a verb and its semantic arguments in the

corpus

  • Amenable to collecting representative

statistics

slide-13
SLIDE 13

English PropBank Annotation

  • Two steps are involved in annotation

– Choose a sense ID for the predicate – Annotate the arguments of that predicate with semantic roles

  • This requires two components: frame files and

PropBank tagset

slide-14
SLIDE 14

PropBank Frame files

  • PropBank defines semantic roles on a verb-by-

verb basis

  • This is defined in a verb lexicon consisting of

frame files

  • Each predicate will have a set of roles

associated with a distinct usage

  • A polysemous predicate can have several

rolesets within its frame file

slide-15
SLIDE 15

An example

  • John rings the bell

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for

slide-16
SLIDE 16

An example

  • John rings the bell
  • Tall aspen trees ring the lake

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

slide-17
SLIDE 17

An example

  • [John] rings [the bell]
  • [Tall aspen trees] ring [the lake]

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

Ring.01 Ring.02

slide-18
SLIDE 18

An example

  • [JohnARG0] rings [the bellARG1]
  • [Tall aspen treesARG1] ring [the lakeARG2]

ring.01 Make sound of bell Arg0 Causer of ringing Arg1 Thing rung Arg2 Ring for ring.02 To surround Arg1 Surrounding entity Arg2 Surrounded entity

Ring.01 Ring.02

slide-19
SLIDE 19

Frame files

  • The Penn Treebank had about 3185 unique

lemmas (Palmer, Gildea, Kingsbury, 2005)

  • Most frequently occurring verb: say
  • Small number of verbs had several framesets

e.g. go, come, take, make

  • Most others had only one frameset per file
slide-20
SLIDE 20

PropBank annotation pane in Jubilee

slide-21
SLIDE 21

English PropBank Tagset

  • Numbered arguments Arg0, Arg1, and so on

until Arg4

  • Modifiers with function tags e.g. ArgM-LOC

(location) , ArgM-TMP (time), ArgM-PRP (purpose)

  • Modifiers give additional information about when,

where or how the event occurred

slide-22
SLIDE 22

PropBank tagset

Numbered Argument Description Arg0 Agent, causer, experiencer Arg1 Theme, patient Arg2 Instrument, benefactive, attribute Arg3 starting point, benefactive, attribute Arg4 ending point

  • Correspond to the valency requirements of the verb
  • Or, those that occur with high frequency with that verb
slide-23
SLIDE 23

PropBank tagset

  • 15 modifier labels for English PropBank
  • [HeArg0] studied [economic growthArg1] [in

IndiaArgM-LOC]

Modifier Description ArgM-LOC Location ArgM-TMP Time ArgM-GOL Goal ArgM-MNR Manner ArgM-CAU Cause ArgM-ADV Adverbial

slide-24
SLIDE 24

PropBank tagset

  • Verb specific and more generalized
  • Arg0 and Arg1 correspond to Dowty’s Proto

Roles

  • Leverage the commonalities among semantic

roles

  • Agents, causers, experiencers – Arg0
  • Undergoers, patients, themes- Arg1
slide-25
SLIDE 25

PropBank tagset

  • While annotating Arg0 and Arg1:

– Unaccusative verbs take Arg1 as their subject argument

  • [The windowArg1] broke

– Unergatives will take Arg0

  • [JohnArg0] sang
  • Distinction is also made between internally

caused events (blush: Arg0) & externally caused events (redden: Arg1)

slide-26
SLIDE 26

PropBank tagset

  • How might these map to the more familiar

thematic roles?

  • Yi, Loper & Palmer (2007) describe such a

mapping to VerbNet roles

slide-27
SLIDE 27
  • More frequent Arg0 and Arg1

(85%) are learnt more easily by automatic systems

  • Arg2 is less frequent, maps to more

than one thematic role

  • Arg3-5 are even more infrequent
slide-28
SLIDE 28

Using PropBank

  • As a computational resource

– Train semantic role labellers (Pradhan et al, 2005) – Question answering systems (with FrameNet) – Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005)

  • For linguists, to study various phenomena related

to predicate-argument structure

slide-29
SLIDE 29

Outline

  • English PropBank
  • Background
  • Annotation
  • Frame files & Tagset
  • Hindi PropBank development
  • Adapting Frame files
  • Light verbs
  • Mapping from dependency labels
slide-30
SLIDE 30

Developing PropBank for Hindi-Urdu

  • Hindi-Urdu PropBank is part of a project to

develop a Multi-layered and multi- representational treebank for Hindi-Urdu

– Hindi Dependency Treebank – Hindi PropBank – Hindi Phrase Structure Treebank

  • Ongoing project at CU-Boulder
slide-31
SLIDE 31

Hindi-Urdu PropBank

  • Corpus of 400,000 words for Hindi
  • Smaller corpus of 150,000 words for Urdu
  • Hindi corpus consists of newswire text from

‘Amar Ujala’

  • So far..

– 220 verb frames – ~100K words annotated

slide-32
SLIDE 32

Developing Hindi PropBank

  • Making a PropBank resource for a new

language

– Linguistic differences

  • Capturing relevant language-specific phenomena

– Annotation practices

  • Maintain similar annotation practices

– Consistency across PropBanks

slide-33
SLIDE 33

Developing Hindi PropBank

  • PropBank annotation for English, Chinese &

Arabic was done on top of phrase structure trees

  • Hindi PropBank is annotated on dependency

trees

slide-34
SLIDE 34

Dependency tree

  • Represent relations that hold between

constituents (chunks)

  • Karaka labels show the relations between

head verb and its dependents

दि ये gave k1 k2 पैसे money औरत को woman dat राम ने Raam erg k4

slide-35
SLIDE 35

Hindi PropBank

  • There are three components to the

annotation

  • Hindi Frame file creation
  • Insertion of empty categories
  • Semantic role labelling
slide-36
SLIDE 36

Hindi PropBank

  • There are three components to the

annotation

  • Hindi Frame file creation
  • Insertion of empty categories
  • Semantic role labelling
  • Both frame creation and labelling require new

strategies for Hindi

slide-37
SLIDE 37

Hindi PropBank

  • Hindi frame files were adapted to include

– Morphological causatives – Unaccusative verbs – Experiencers

  • Additionally, changes had to be made to

analyze the large number (nearly 40%) of light verbs

slide-38
SLIDE 38

Causatives

  • Verbs that are related via morphological

derivation are grouped together as individual predicates in the same frame file.

  • E.g. Cornerstone’s multi-lemma mode is used for the

verbs खा (KA ‘eat’, खखऱा (KilA ‘feed’ and खखऱवा (KilvA ‘cause to feed’)

slide-39
SLIDE 39

Causatives

raam neArg0 khaanaArg1 khaayaa

Ram erg food eat-perf

‘Ram ate the food’ mohan neArg0 raam koArg2 khaanaArg1 khilaayaa

Mohan erg Ram dat food eat-caus-perf

‘Mohan made Ram eat the food’ sita neArgC mohan seArgA raam koArg2 khaanaArg1 khilvaayaa

Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf

‘Sita, through Mohan made Ram eat the food ’

Roleset id: KA.01 to eat Arg0 eater Arg1 the thing being eaten Roleset id: KilA.01 to feed Arg0 feeder Arg2 eater Arg1 the thing being eaten Roleset id: KilvA.01 to cause to be fed ArgC Causer of feeding ArgA feeder Arg2 Eater Arg1 the thing eaten

slide-40
SLIDE 40

Unaccusatives

  • PropBank needs to distinguish proto agents

and proto patients

  • Sudha danced –

– Unergative, animate agentive arguments- Arg0

  • The door opened

– Unaccusative, non animate patient arguments- Arg1

  • For English, these distinctions were available

in VerbNet

  • For Hindi, various diagnostic tests are applied

to make this distinction

slide-41
SLIDE 41

Yes No- Second Stage

  • Ergative. E.g. naac, dauRa,

bEtha? First stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa No-Third stage Yes.Unacc-usative. Eliminate bEtha

  • No. Take the majority

vote on the tests.

cognate object & ergative case tests Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects For non-alternating verbs and others that remain, the verb will be unaccusative if:

  • Impersonal passives are not possible
  • use of ‘huaa’ (past participial relative) is possible
  • the inanimate subject appears without overt genitive

Unaccusativity diagnostics (5)

slide-42
SLIDE 42

Unaccusative verb : महक mahak ‘to smell good’ Arg1 entity that smells good

  • impersonal passive test
  • past participial relative test

Unaccusativity

  • This information is then captured in the frame
slide-43
SLIDE 43

Experiencers

  • Arg0 includes agents, causers, experiencers
  • In Hindi, the experiencer subjects occur with dikhnaa (to

glimpse), milnaa (to find), suujhnaa (to be struck with) etc.

  • Typically marked with dative case

– Mujh-ko chaand dikhaa

  • I. Dat moon glimpse.pf

(I glimpsed the moon)

  • PropBank labels these as Arg0, with an additional

descriptor field ‘experiencer’

– Internally caused events

slide-44
SLIDE 44

Complex predicates

  • A large number of complex predicates in Hindi

– Noun-verb complex predicates

  • vichaar karna; think do (to think)
  • dhyaan denaa; attention give (to pay attention)

– Adjectives can also occur

  • accha lagnaa: good seem (to like)

– the predicating element is not the verb alone, but forms a composite with the noun/adj

slide-45
SLIDE 45

Complex predicates

  • There are also a number of verb-verb complex

predicates

  • Combine with the bare form of the verb to convey

different meanings

  • ro paRaa; cry lie- burst out crying
  • Add some aspectual meaning to the verb
  • As these occur with only a certain set of verbs,

we find them automatically, based on part of speech tag

slide-46
SLIDE 46

Complex predicates

  • However, the noun-verb complex predicates

are more pervasive

  • They occur with a large number of nouns
  • Frame files for the nominal predicate need to

be created

  • E.g. in a sample of 90K words, the light verb

kar; ‘do’ occurred 1738 times with 298 different nominals

slide-47
SLIDE 47

Complex predicates

  • Annotation strategy
  • Additional resources (frame files)
  • Cluster the large number of nominals?
slide-48
SLIDE 48

Light verb annotation

  • A convention for the annotation of light verbs has

been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010)

  • Annotation is carried out in three passes

– Manual identification of light verb – Annotation of arguments based on frame file – Deterministic merging of the light verb and ‘true’ predicate

  • In Hindi, this process may be simplified because
  • f the dependency label ‘pof’ that identifies a

light verb

slide-49
SLIDE 49

राम ने पैसे चोरी कीए

Raam erg Money theft Do.perf ‘Ram stole the money’

  • 1. Identify the N+V sequences that are complex predicates.
  • 2. Annotate predicating expression with ARG-PRX.

REL: kar ARG-PRX: corii 3. Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame 4. Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese

slide-50
SLIDE 50

Hindi PropBank tagset

24 labels

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument

50

slide-51
SLIDE 51

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source Function tags

51

slide-52
SLIDE 52

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer Function tags Causative

52

slide-53
SLIDE 53

Annotating Hindi PropBank

Label Description ARG0 Agent, causer, experiencer ARG1 Patient, theme, undergoer ARG2 Beneficiary ARG3 Instrument ARG2-ATR ARG2-LOC attribute ARG2-GOL ARG2-SOU goal location source ARGC causer ARGA secondary causer ARGM-VLV Verb-verb construction ARGM-PRX Noun-verb construction Function tags Causative Complex predicate

53

slide-54
SLIDE 54

Annotating Hindi PropBank

Label Description Label Description ARGM-ADV adverb ARGM-CAU cause ARGM-DIR direction ARGM-DIS discourse ARGM-EXT extent ARGM-LOC location ARGM-MNR manner ARGM-MNS means ARGM-MOD modal ARGM-NEG negation ARGM-PRP purpose ARGM-TMP time

  • Other modifier labels

54

slide-55
SLIDE 55

Hindi PropBank annotation

Annotation pane Frameset file display

slide-56
SLIDE 56

Annotation practice

  • Need to maintain good annotation practices
  • Current practices- double blind, followed by

adjudication

  • Inter-annotator agreement measures the

consistency in the annotation task

  • English PropBank had high inter annotator

agreement K=0.91

slide-57
SLIDE 57

Hindi PropBank annotation

  • Improve consistency & annotation speed
  • PropBank annotation on dependency trees

has some advantages

  • Hindi Treebank uses a large set of dependency

labels that have rich semantic information

दि ये gave Arg0 k1 Arg1 k2 Arg2 k4 पैसे money औरत को woman dat राम ने Raam erg

slide-58
SLIDE 58

Deriving PropBank labels from dependencies

  • We can derive Hindi PropBank labels from

Dependency labels

  • Mapping will reduce annotation effort,

improve speed

  • The dependency tagset has labels that are in

some ways fairly similar to PropBank

– Verb specific labels k1- 5 – Verb modifier labels k7p, k7t, rh etc.

slide-59
SLIDE 59

Label comparison

  • Using linguistic intuition, we can compare HDT labels

with the numbered arguments in HPB

59

slide-60
SLIDE 60

Label comparison

  • Similarly, linguistic intuition gives us the mapping from HDT

for HPB modifiers

60

slide-61
SLIDE 61

Label comparison

  • These mappings are included in the PB frame files,

for example, the verb ‘A: to come’

  • Only for numbered arguments
  • Basis for the linguistically motivated rules

Roleset Usage Rule A.01 To come (path) k1  Arg1 k2p Arg2-GOL A.03 to arrive k1  Arg0 k2p Arg2-GOL k5 Arg2-SOU

61

slide-62
SLIDE 62

Automatic mapping of DT to PB

  • A rule based, probabilistic system for

automatic mapping

  • We use two kinds of resources:

– Annotated corpus [Treebank+ PropBank]

  • 32,300 tokens, 2005 predicates

– Frame files with mapping rules

62

slide-63
SLIDE 63

Argument classification

  • We use three kinds of rules to carry out

automatic mapping

– Deterministic rules

  • Dependency label mapped directly onto PropBank

– Empirically derived rules

  • Using corpus statistics associated with dependency &

PropBank labels

– Linguistically motivated rules

  • Derived from linguistic intuition & captured in frame

files

63

slide-64
SLIDE 64

Example of the rules

Features Count PropBank labels xe.01_active_k1 (give) 32 Arg0: 0.93 Arg1: 0.03 Arg2: 0.03 xe.01_active_k2 65 Arg1: 0.95 Arg2: 0.01 Arg0: 0.01 xe.01_active_k4 34 Arg2: 0.94 Arg0: 0.02

  • Associate the probability of each label in combination with a

particular feature tuple

  • We use only 3: roleset ID, voice, dependency label
  • For the verb give, we get the correct mapping to the Hindi

labels

slide-65
SLIDE 65

Evaluation

  • 32,300 tokens of annotated data
  • Ten-fold cross validation for evaluation

65

slide-66
SLIDE 66

Results

PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44

66

slide-67
SLIDE 67

Results

PRECISION RECALL F1 SCORE Empirically derived rules 90.59 47.92 62.69 Linguistically motivated rules 89.80 55.28 68.44 Numbered Argument Accuracy

PRECISION RECALL F-SCORE Empirically derived rules 93.63 58.76 72.21 Linguistically motivated rules 91.87 72.36 80.96

67

slide-68
SLIDE 68

Evaluation

  • Linguistically motivated rules improve the

recall with a slight drop in the precision

  • With the most frequent PropBank labels,

empirically derived rules perform well

  • More data should improve the performance

for modifier arguments

68

slide-69
SLIDE 69

Evaluation

  • Annotation practices also affect the mapping

– PB labels are coarse-grained: E.g. ArgM-ADV maps to four different dependency labels – PB are fine-grained: E.g. ‘means’ and ‘causes’ are distinguished (ArgM-MNS, ArgM-CAU)but are lumped together under a single label in dependency treebank

69

slide-70
SLIDE 70

Evaluation

  • Our goal is to find a useful set of mappings to

use at a pre-annotation stage

  • We expect that with more PropBanked data,

empirically derived rules will perform better

  • Using additional resources such as Hindi

WordNet can help to make up for the lack of training data

70

slide-71
SLIDE 71

Summary

  • Semantic role labels
  • English PropBank

– Frame files – PropBank tagset

  • Development of Hindi PropBank

– Linguistic issues – Experiments

slide-72
SLIDE 72

Questions?