Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN - - PowerPoint PPT Presentation

parsing
SMART_READER_LITE
LIVE PREVIEW

Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN - - PowerPoint PPT Presentation

1/18 Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Chinese University of Hong Kong Fall 2015 2/18 Parsing input: 0011 If so, how to build a parse tree with a program? S 0 S 1 | 1 S 0 S | T T S | Is 0011 L


slide-1
SLIDE 1

1/18

Parsing

CSCI 3130 Formal Languages and Automata Theory Siu On CHAN

Chinese University of Hong Kong

Fall 2015

slide-2
SLIDE 2

2/18

Parsing

S → 0S1 | 1S0S | T T → S | ε

input: 0011 Is 0011 ∈ L? If so, how to build a parse tree with a program?

slide-3
SLIDE 3

3/18

Parsing

S → 0S1 | 1S0S | T T → S | ε

input: 0011 Try all derivations?

S T ε S

… 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

slide-4
SLIDE 4

3/18

Parsing

S → 0S1 | 1S0S | T T → S | ε

input: 0011 Try all derivations?

S T ε S

… 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

slide-5
SLIDE 5

3/18

Parsing

S → 0S1 | 1S0S | T T → S | ε

input: 0011 Try all derivations?

S T ε S

… 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

slide-6
SLIDE 6

3/18

Parsing

S → 0S1 | 1S0S | T T → S | ε

input: 0011 Try all derivations?

S T ε S

… 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

slide-7
SLIDE 7

4/18

Problems

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop

Let’s tackle the 2nd problem

slide-8
SLIDE 8

5/18

When to stop

S → 0S1 | 1S0S | T T → S | ε

Idea: Stop when

|derived string| > |input|

Problems:

S

0S1 0T1 01 Derived string may shrink because of “ -productions”

S T S T

Derviation may loop because

  • f “unit productions”

Remove and unit productions

slide-9
SLIDE 9

5/18

When to stop

S → 0S1 | 1S0S | T T → S | ε

Idea: Stop when

|derived string| > |input|

Problems:

S ⇒ 0S1 ⇒ 0T1 ⇒ 01

Derived string may shrink because of “ε-productions”

S T S T

Derviation may loop because

  • f “unit productions”

Remove and unit productions

slide-10
SLIDE 10

5/18

When to stop

S → 0S1 | 1S0S | T T → S | ε

Idea: Stop when

|derived string| > |input|

Problems:

S ⇒ 0S1 ⇒ 0T1 ⇒ 01

Derived string may shrink because of “ε-productions”

S ⇒ T ⇒ S ⇒ T ⇒ . . .

Derviation may loop because

  • f “unit productions”

Remove ε and unit productions

slide-11
SLIDE 11

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a B → ε C → ED | ε D → BC | b E → b D C S AD D C E S A

Removing

slide-12
SLIDE 12

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a

✘✘✘ ✘

B → ε C → ED | ε D → BC | b E → b D → C S AD D C E S A

Removing B → ε

slide-13
SLIDE 13

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a

✘✘✘ ✘

B → ε C → ED | ✁ ε D → BC | b E → b D → C S → AD D C E S A

Removing C → ε

slide-14
SLIDE 14

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a

✘✘✘ ✘

B → ε C → ED | ✁ ε D → BC | b E → b D → C S → AD D → ε C E S A

Removing C → ε

slide-15
SLIDE 15

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a

✘✘✘ ✘

B → ε C → ED | ✁ ε D → BC | b E → b D → C S → AD

✘✘✘ ✘

D → ε C → E S A

Removing D → ε

slide-16
SLIDE 16

6/18

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ

S → ACD A → a

✘✘✘ ✘

B → ε C → ED | ✁ ε D → BC | b E → b D → C S → AD

✘✘✘ ✘

D → ε C → E S → A

Removing D → ε

slide-17
SLIDE 17

7/18

Eliminating ε-productions

For every A → ε rule where A is not the start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ Do 2. every time A appears

B → αAβAγ yields B → αβAγ B → αAβγ B → αβγ B A becomes B

If B was removed earlier, don’t add it back

slide-18
SLIDE 18

7/18

Eliminating ε-productions

For every A → ε rule where A is not the start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ Do 2. every time A appears

B → αAβAγ yields B → αβAγ B → αAβγ B → αβγ B → A becomes B → ε

If B → ε was removed earlier, don’t add it back

slide-19
SLIDE 19

8/18

Eliminating unit productions

A unit production is a production of the form

A → B

Grammar:

S → 0S1 | 1S0S | T T → S | R | ε R → 0SR

Unit production graph:

S T R

slide-20
SLIDE 20

9/18

Removing unit productions

1 If there is a cycle of unit productions

A → B → · · · → C → A

delete it and replace everything with A

S → 0S1 | 1S0S | T T → S | R | ε R → 0SR S T R S

0S1 1S0S

S R R

0SR Replace T by S

slide-21
SLIDE 21

9/18

Removing unit productions

1 If there is a cycle of unit productions

A → B → · · · → C → A

delete it and replace everything with A

S → 0S1 | 1S0S |

  • T
  • T → ✓

S | R | ε R → 0SR S T R S → 0S1 | 1S0S S → R | ε R → 0SR

Replace T by S

slide-22
SLIDE 22

10/18

Removal of unit productions

2 replace any chain

A → B → · · · → C → α

by

A → α, B → α, . . . , C → α S → 0S1 | 1S0S | R | ε R → 0SR S R S

0S1 1S0S 0SR

R

0SR Replace

S R

0SR by

S

0SR

R

0SR

slide-23
SLIDE 23

10/18

Removal of unit productions

2 replace any chain

A → B → · · · → C → α

by

A → α, B → α, . . . , C → α S → 0S1 | 1S0S | R | ε R → 0SR S R S → 0S1 | 1S0S | 0SR | ε R → 0SR

Replace

S → R → 0SR

by

S → 0SR, R → 0SR

slide-24
SLIDE 24

11/18

Recap

Problems:

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop

Solution to problem 2:

  • 1. Eliminate ε productions
  • 2. Eliminate unit productions

Try all possible derivations but stop parsing when

|derived string| > |input|

slide-25
SLIDE 25

12/18

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0

input: 0011

S

0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011

L

slide-26
SLIDE 26

12/18

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0

input: 0011

S

0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011

L

slide-27
SLIDE 27

12/18

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0

input: 0011

S

0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011 /

∈ L

slide-28
SLIDE 28

13/18

Problems

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop
slide-29
SLIDE 29

14/18

Preparations

A faster way to parse: Cocke–Younger–Kasami algorithm To use it we must perprocess the CFG: Eliminate ε productions Eliminate unit productions Convert CFG to Chomsky Normal Form

slide-30
SLIDE 30

15/18

Chomsky Normal Form

A CFG is in Chomsky Normal Form if every production has the form

A → BC

  • r

A → a

where neither B nor C is the start variable but we also allow

S → ε

for start variable S Noam Chomsky Convert to Chomsky Normal Form: A → BcDE

= ⇒

replace terminals with new variables

A → BCDE C → c

= ⇒

break up sequences with new variables

A → BX X → CY Y → DE C → c

slide-31
SLIDE 31

16/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba let

x[i, ℓ] = xixi+1 . . . xi+ℓ−1

b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A C A C B A C S A B S C S A For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

slide-32
SLIDE 32

16/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba let

x[i, ℓ] = xixi+1 . . . xi+ℓ−1

b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A C A C B A C S A B S C S A For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

slide-33
SLIDE 33

16/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba let

x[i, ℓ] = xixi+1 . . . xi+ℓ−1

b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S A B S C S A For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

slide-34
SLIDE 34

16/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba let

x[i, ℓ] = xixi+1 . . . xi+ℓ−1

b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S|A B S C S A For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

slide-35
SLIDE 35

16/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba let

x[i, ℓ] = xixi+1 . . . xi+ℓ−1

b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S|A B S|C S|A For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

slide-36
SLIDE 36

17/18

Computing T[i, ℓ] for ℓ 2

To compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1 A|C B 2 B S|A 3 B A|C Look up entries regarding shorter substrings previously computed

S AB BC A BA

a

B CC

b

C AB

a

T S A C

slide-37
SLIDE 37

17/18

Computing T[i, ℓ] for ℓ 2

To compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1 A|C B 2 B S|A 3 B A|C Look up entries regarding shorter substrings previously computed

S AB BC A BA

a

B CC

b

C AB

a

T S A C

slide-38
SLIDE 38

17/18

Computing T[i, ℓ] for ℓ 2

To compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1 A|C B 2 B S|A 3 B A|C Look up entries regarding shorter substrings previously computed

S → AB | BC A → BA | a B → CC | b C → AB | a T[2, 4] = S|A|C

slide-39
SLIDE 39

18/18

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a

Input: x = baaba b a a b a

i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C A S| B S|C S|A

  • B

B

  • S|A|C

S|A|C Get parse tree by tracing back derivations