[PPT] - Inspecting the Structural Biases of Dependency Parsing Algorithms PowerPoint Presentation

SLIDE 1

Inspecting the Structural Biases

f Dependency Parsing Algorithms

Yoav Goldberg and Michael Elhadad

Ben Gurion University

CoNLL 2010, Sweden

SLIDE 2

There are many ways to parse a sentence

SLIDE 3

There are many ways to parse a sentence

Transition Based Parsers

Covington
Multiple Passes
Arc-Eager
Arc-Standard
With Swap Operator
First-Best Parser
DAG Parsing
With a Beam
With Dynamic Programming
With Tree Revision
Left-to-right
Right-to-left

Easy-First Parsing (check out

ur naacl 2010 paper)

Graph Based Parsers

First Order
Second Order (two children /

with grandparent)

Third Order
MST Algorithm / Matrix Tree

Theorem

Eisner Algorithm
Belief Propagation
With global constraints (ILP /

gibbs sampling) Combinations

Voted Ensembles (Sagae’s

way, Attardi’s way)

Stacked Learning

SLIDE 4

We can build many reasonably accurate parsers

SLIDE 5

We can build many reasonably accurate parsers Parser combinations work

SLIDE 6

We can build many reasonably accurate parsers Parser combinations work ⇒ every parser has its strong points

SLIDE 7

We can build many reasonably accurate parsers Parser combinations work ⇒ every parser has its strong points

Different parsers behave differently

SLIDE 8

Open questions

SLIDE 9

Open questions WHY do they behave as they do?

SLIDE 10

Open questions WHY do they behave as they do? WHAT are the differences between them?

SLIDE 11

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

SLIDE 21

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

◮ Focus on single-edge errors

SLIDE 22

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

◮ Focus on single-edge errors

◮ MST better for long edges, MALT better for short ◮ MST better near root, MALT better away from root ◮ MALT better at nouns and pronouns, MST better at others

SLIDE 23

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

◮ Focus on single-edge errors

◮ MST better for long edges, MALT better for short ◮ MST better near root, MALT better away from root ◮ MALT better at nouns and pronouns, MST better at others

◮ . . . but all these differences are very small

SLIDE 24

we do something a bit different

SLIDE 25

Assumptions

◮ Parsers fail in predictable ways ◮ those can be analyzed ◮ analysis should be done by inspecting trends rather than

individual decisions

SLIDE 26

Note: We do not do error analysis

SLIDE 27

Note: We do not do error analysis

◮ Error analysis is complicated

◮ one error can yield another / hide another

SLIDE 28

Note: We do not do error analysis

◮ Error analysis is complicated

◮ one error can yield another / hide another

◮ Error analysis is local to one tree

◮ many factors may be involved in that single error

SLIDE 29

Note: We do not do error analysis

◮ Error analysis is complicated

◮ one error can yield another / hide another

◮ Error analysis is local to one tree

◮ many factors may be involved in that single error

we are aiming at more global trends

SLIDE 30

Structural Preferences

SLIDE 31

Structural preferences

for a given language+syntactic theory

◮ Some structures are more common than others

◮ (think Right Branching for English)

SLIDE 32

Structural preferences

for a given language+syntactic theory

◮ Some structures are more common than others

◮ (think Right Branching for English)

◮ Some structures are very rare

◮ (think non-projectivity, OSV constituent order)

SLIDE 33

Structural preferences

parsers also exhibit structural preferences

SLIDE 34

Structural preferences

parsers also exhibit structural preferences

◮ Some are explicit / by design

◮ e.g. projectivity

SLIDE 35

Structural preferences

parsers also exhibit structural preferences

◮ Some are explicit / by design

◮ e.g. projectivity

◮ Some are implicit, stem from

◮ features ◮ modeling ◮ data ◮ interactions ◮ and other stuff

SLIDE 36

Structural preferences

parsers also exhibit structural preferences

◮ Some are explicit / by design

◮ e.g. projectivity

◮ Some are implicit, stem from

◮ features ◮ modeling ◮ data ◮ interactions ◮ and other stuff

These trends are interesting!

SLIDE 37

Structural Bias

SLIDE 38

Structural bias

“The difference between the structural preferences of two languages”

SLIDE 39

Structural bias

“The difference between the structural preferences of two languages” For us: Which structures tend to occur more in language than in parser?

SLIDE 40

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

◮ claim about error pattern

SLIDE 41

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

◮ claim about error pattern

Parser X tends to attach PPs low, while language Y tends to attach them high

◮ claim about structural bias (and also about errors)

SLIDE 42

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

◮ claim about error pattern

Parser X tends to attach PPs low, while language Y tends to attach them high

◮ claim about structural bias (and also about errors)

Parser X can never produce structure Y

◮ claim about structural bias

SLIDE 43

Formulating Structural Bias

“given a tree, can we say where it came from?” ?

SLIDE 44

Formulating Structural Bias

“given two trees of the same sentence, can we tell which parser produced each parse?” ?

SLIDE 45

Formulating Structural Bias

“which parser produced which tree?” ? any predictor that can help us answer this question is an indicator of structural bias

SLIDE 46

Formulating Structural Bias

“which parser produced which tree?” ? any predictor that can help us answer this question is an indicator of structural bias

SLIDE 47

Formulating Structural Bias

“which parser produced which tree?” ? any predictor that can help us answer this question is an indicator of structural bias uncovering structural bias = searching for good predictors

SLIDE 48

Method

◮ start with two sets of parses for same set of sentences ◮ look for predictors that allow to distinguish between trees in

each group

SLIDE 49

Our Predictors

◮ all possible subtrees

SLIDE 50

Our Predictors

◮ all possible subtrees ◮ always encode:

◮ parts of speech ◮ relations ◮ direction

JJ NN VB IN

SLIDE 51

Our Predictors

◮ all possible subtrees ◮ always encode:

◮ parts of speech ◮ relations ◮ direction

◮ can encode also:

◮ lexical items

JJ NN VB IN/with

SLIDE 52

Our Predictors

◮ all possible subtrees ◮ always encode:

◮ parts of speech ◮ relations ◮ direction

◮ can encode also:

◮ lexical items ◮ distance to parent

JJ 4 NN VB IN/with 2

SLIDE 53

Search Procedure

boosting with subtree features

algorithm by Kudo and Matsumoto 2004.

SLIDE 54

Search Procedure

boosting with subtree features

algorithm by Kudo and Matsumoto 2004. very briefly:

SLIDE 55

Search Procedure

boosting with subtree features

algorithm by Kudo and Matsumoto 2004. very briefly:

◮ input: two sets of constituency trees ◮ while not done:

◮ choose a subtree that classifies most trees correctly ◮ re-weight trees based on errors

SLIDE 56

Search Procedure

boosting with subtree features

algorithm by Kudo and Matsumoto 2004. very briefly:

◮ input: two sets of constituency trees ◮ while not done:

◮ choose a subtree that classifies most trees correctly ◮ re-weight trees based on errors

◮ output: weighted subtrees (= linear classifier)

SLIDE 57

SLIDE 58

SLIDE 59

SLIDE 60

SLIDE 61

SLIDE 62

conversion to constituency

JJ 3 NN VB IN/with 2 JJ→ d:3 VB← NN→ IN← w:with d:2 mandatory information at node label

ptional information as leaves

SLIDE 63

conversion to constituency

JJ 3 NN VB IN/with 2 JJ→ d:3 VB← NN→ IN← w:with d:2 mandatory information at node label

ptional information as leaves

SLIDE 64

conversion to constituency

JJ 3 NN VB IN/with 2 JJ→ d:3 VB← NN→ IN← w:with d:2 mandatory information at node label

ptional information as leaves

SLIDE 65

Experiments

Analyzed Parsers

◮ Malt Eager ◮ Malt Standard ◮ Mst 1 ◮ Mst 2

SLIDE 66

Experiments

Analyzed Parsers

◮ Malt Eager ◮ Malt Standard ◮ Mst 1 ◮ Mst 2

Data

◮ WSJ (converted using Johansson and Nugues) ◮ splits: parse-train (15-18), boost-train (10-11), boost-val

(4-7)

◮ gold pos-tags

SLIDE 67

Quantitative Results

Q: Are the parsers biased with respect to English?

SLIDE 68

Quantitative Results

Q: Are the parsers biased with respect to English? A: Yes

SLIDE 69

Quantitative Results

Q: Are the parsers biased with respect to English? A: Yes Parser Train Accuracy Val Accuracy MST1 65.4 57.8 MST2 62.8 56.6 MALTE 69.2 65.3 MALTS 65.1 60.1

Table: Distinguishing parser output from gold-trees based on structural information

SLIDE 70

Qualitative Results (teasers)

Over-produced by ArcEager:

SLIDE 71

Qualitative Results (teasers)

Over-produced by ArcEager:

ROOT→“ ROOT→DT ROOT→WP (we knew it’s bad at root, now we know how!)

SLIDE 72

Qualitative Results (teasers)

Over-produced by ArcEager:

ROOT→“ ROOT→DT ROOT→WP ROOT VBD VBD (we knew it’s bad at root, now we know how!)

SLIDE 73

Qualitative Results (teasers)

Over-produced by ArcEager and ArcStandard

SLIDE 74

Qualitative Results (teasers)

Over-produced by ArcEager and ArcStandard

→VBD − →

9+ VBD

→VBD − →

5−7 VBD

ROOT→VBZ→VBZ (prefer first verb above second one: because of left-to-right processing? )

SLIDE 75

Qualitative Results (teasers)

Over-produced by MST1

SLIDE 76

Qualitative Results (teasers)

Over-produced by MST1

→ IN NN NN NN NN VBZ (independence assumption failing)

SLIDE 77

Qualitative Results (teasers)

Under-produced by MST1 and MST2

SLIDE 78

Qualitative Results (teasers)

Under-produced by MST1 and MST2

→ NN IN CC NN (hard time in coordinating “heavy” NPs: due to pos-in-between feature?)

SLIDE 79

Qualitative Results (teasers)

More in paper

You should read it

SLIDE 80

Qualitative Results (teasers)

Software available

Try with your language / parser

SLIDE 81

To Conclude

◮ understanding HOW parsers behave and WHY is

important

◮ we should do more of that

◮ we defined structural bias as way of characterizing

behaviour

◮ we presented an algorithm for uncovering structural bias ◮ applied to English with interesting results

Inspecting the Structural Biases

Yoav Goldberg and Michael Elhadad

Ben Gurion University

CoNLL 2010, Sweden

There are many ways to parse a sentence

There are many ways to parse a sentence

Transition Based Parsers

Easy-First Parsing (check out

Graph Based Parsers

with grandparent)

Theorem

gibbs sampling) Combinations

way, Attardi’s way)

We can build many reasonably accurate parsers

We can build many reasonably accurate parsers Parser combinations work

We can build many reasonably accurate parsers Parser combinations work ⇒ every parser has its strong points

We can build many reasonably accurate parsers Parser combinations work ⇒ every parser has its strong points

Different parsers behave differently

Open questions

Open questions WHY do they behave as they do?

Open questions WHY do they behave as they do? WHAT are the differences between them?

More open questions Which linguistic phenomena are hard for parser X?

More open questions Which linguistic phenomena are hard for parser X? What kinds of errors are common for parser Y?

More open questions Which linguistic phenomena are hard for parser X? What kinds of errors are common for parser Y? Which parsing approach is most suitable for language Z?

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

Previously

McDonald and Nivre 2007: “Characterize the Errors of Data-Driven Dependency Parsing Models”

we do something a bit different

Assumptions

individual decisions

Note: We do not do error analysis

Note: We do not do error analysis

Note: We do not do error analysis

Note: We do not do error analysis

we are aiming at more global trends

Structural Preferences

Structural preferences

for a given language+syntactic theory

Structural preferences

for a given language+syntactic theory

Structural preferences

parsers also exhibit structural preferences

Structural preferences

parsers also exhibit structural preferences

Structural preferences

parsers also exhibit structural preferences

Structural preferences

parsers also exhibit structural preferences

These trends are interesting!

Structural Bias

Structural bias

“The difference between the structural preferences of two languages”

Structural bias

“The difference between the structural preferences of two languages” For us: Which structures tend to occur more in language than in parser?

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

Parser X tends to attach PPs low, while language Y tends to attach them high

Bias vs. Error

related, but not the same

Parser X makes many PP attachment errors

Parser X tends to attach PPs low, while language Y tends to attach them high

Parser X can never produce structure Y

Formulating Structural Bias

“given a tree, can we say where it came from?” ?

Formulating Structural Bias

“given two trees of the same sentence, can we tell which parser produced each parse?” ?

Formulating Structural Bias

“which parser produced which tree?” ? any predictor that can help us answer this question is an indicator of structural bias

Formulating Structural Bias

“which parser produced which tree?” ? any predictor that can help us answer this question is an indicator of structural bias

Formulating Structural Bias