[PPT] - Lexical and Syntactic Analysis an example Example: We would like to PowerPoint Presentation

SLIDE 1

Lexical and Syntactic Analysis — an example

Example: We would like to recognize a language of arithmetic expressions containing expressions such as: 34 x+1

x * 2 + 128 * (y - z / 3)

The expressions can contain number constants — sequences of digits 0, 1, . . . , 9. The expressions can contain names of variables — sequences consisting of letters, digits, and symbol “ ”, which do not start with a digit. The expressions can contain basic arithmetic operations — “+”, “-”, “*”, “/”, and unary “-”. It is possible to use parentheses — “(” and “)”, and to use a standard priority of arithmetic operations.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 1 / 54

SLIDE 2

Lexical and Syntactic Analysis — an example

The problem we want to solve: Input: a sequence of characters (e.g., a string, a text file, etc.) Output: an abstract syntax tree representing the structure of a given expression, or an information about a syntax error in the expression

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 2 / 54

SLIDE 3

Lexical and Syntactic Analysis — an example

It is convenient to decompose this problem into several parts: Lexical analysis — recognizing of lexical elements (so called tokens) such as for example identifiers, number constants, operators, etc. Syntactic analysis — determining whether a given sequence of tokens corresponds to an allowed structure of expressions; basically, it means finding corresponding derivation (resp. derivation tree) for a given word in a context-free grammar representing the given language (e.g., in our case, the language of all well-formed expressions). Construction of an abstract syntax tree — this phase is usually connected with the syntax analysis, where the result, actually produced by the program, is typically not directly a derivation tree but rather some kind of abstract syntax tree or performing of some actions connected with rules of the given grammar.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 3 / 54

SLIDE 4

Lexical and Syntactic Analysis — an example

Terminals for the grammar representing well-formed expressions: ident — identifier, e.g. “x”, “q3”, “count r12” num — number constant, e.g. “5”, “42”, “65535” “(” — left parenthesis “)” — right parenthesis “+” — plus “-” — minus “*” — star “/” — slash Remark: Recognizing of sequences of symbols that correspond to individual terminals is the goal of lexical analysis.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 4 / 54

SLIDE 5

Lexical and Syntactic Analysis — an example

Example: Expression -x * 2 + 128 * (y - z / 3) is represented by the following sequence of symbols:

x

* 2 + 1 2 8 * ( y

z

/ 3 ) The following sequence of tokens corresponds to this sequence of symbols; these tokens are terminal symbols of the given context-free grammar:

ident * num + num * ( ident - ident / num )
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 5 / 54

SLIDE 6

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the first try: E → ident | num | ( E ) | - E | E + E | E - E | E * E | E / E

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 6 / 54

SLIDE 7

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the first try: E → ident | num | ( E ) | - E | E + E | E - E | E * E | E / E This grammar is ambiguous.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 6 / 54

SLIDE 8

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the second try: E → T | T + E | T - E T → F | F * T | F / T F → ident | num | ( E ) | - F Different levels of priority are represented by different nonterminals: E — expression T — term F — factor This grammar is unambiguous.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 6 / 54

SLIDE 9

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the third try: E → T | T A E A → + | - T → F | F M T M → * | / F → ident | num | ( E ) | - F We create separate nonterminals for operators on different levels of priority: A — additive operator M — multiplicative operator

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 6 / 54

SLIDE 10

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the fourth try: S → E eof E → T | T A E A → + | - T → F | F M T M → * | / F → ident | num | ( E ) | - F It is useful to introduce special nonterminal eof representing the end of input. Moreover, in this grammar the initial nonterminal S does not occur

n the right hand side of any grammar.
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 6 / 54

SLIDE 11

Implementation of Lexical Analysis

Enumerated type Token kind representing different kinds of tokens: T EOF — the end of input T Ident — identifier T Number — number constant T LParen — “(” T RParen — “)” T Plus — “+” T Minus — “-” T Star — “*” T Slash — “/”

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 7 / 54

SLIDE 12

Implementation of Lexical Analysis

Variable c : a currently processed character (resp. a special value eof representing the end of input): at the beginning, the first character in the input is read to variable c function next-char() returns a next charater from the input Some helper functions: error() — outputs an information about a syntax error and aborts the processing of the expression is-ident-start-char(c) — tests whether c is a charater that can occur at the beginning of an identifier is-ident-normal-char(c) — tests whether c is a character that can

ccur in an identifier (on other positions except beginning)

is-digit(c) — tests whether c is a digit

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 8 / 54

SLIDE 13

Implementation of Lexical Analysis

Some other helper functions: create-ident(s) — creates an identifier from a given string s create-number(s) — creates a number from a given string s Auxiliary variables: last-ident — the last processed identifier last-num — the last processed number constant Function next-token() — the main part of the lexical analyser, it returns the following token from the input

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 9 / 54

SLIDE 14

Implementation of Lexical Analysis

next-token (): while c ∈ {“ ”, “\t”} do c := next-char(); if c == eof then return T EOF else switch c do case “(”: do c := next-char(); return T LParen case “)”: do c := next-char(); return T RParen case “+”: do c := next-char(); return T Plus case “–”: do c := next-char(); return T Minus case “*”: do c := next-char(); return T Star case “/”: do c := next-char(); return T Slash

therwise do

if is-ident-start-char(c) then return scan-ident() else if is-digit(c) then return scan-number() else error()

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 10 / 54

SLIDE 15

Implementation of Lexical Analysis

scan-ident (): s := c c := next-char() while is-ident-normal-char(c) do s := s · c c := next-char() last-ident := create-ident(s) return T Ident

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 11 / 54

SLIDE 16

Implementation of Lexical Analysis

scan-number (): s := c c := next-char() while is-digit(c) do s := s · c c := next-char() last-num := create-number(s) return T Number

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 12 / 54

SLIDE 17

Implementation of Syntactic Analysis

Variable t : the last processed token A helper function: init-scanner():

initializes the lexical analyser reads the first character from the input into variable c to ensure that this character is available in the following calls of function next-token()

Reading a next token: next-token():

this is the previously described main function of the lexical analyser by repeatedly calling this function we read the tokens variable c always contains the symbol that has been read last

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 13 / 54

SLIDE 18

Implementation of Syntactic Analysis

One of the often used methods of syntactic analysis is recursive descent: For each nonterminal there is a corresponding function — the function corresponding to nonterminal A implements all rules with nonterminal A on the left-hand side. In a given function, the next token is used to select between corresponding rules. Instructions in the body of a function correspond to processing of right-hand sides of the rules:

an occurrence of nonterminal B — the function corresponding to nonterminal B is called an occurrence of terminal a — it is checked that the following token corresponds to terminal a, when it does, the next token is read,

therwise an error is reported
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 14 / 54

SLIDE 19

Implementation of Syntactic Analysis

The previously described grammed is not very suitable for the recursive descent because it is not possible for nonterminals E and T to determine in a deterministic way one of the given pair of rules by use of just one following symbol: S → E eof E → T | T A E A → + | - T → F | F M T M → * | / F → ident | num | ( E ) | - F For example, if we want to rewrite nonterminal T and we know that the following terminal in the input is num , this terminal can be generated by use of any of the rules T → F T → F M T

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 15 / 54

SLIDE 20

Implementation of Syntactic Analysis

The following modified grammar does not have this problem: S → E eof E → T G G → ε | A T G A → + | - T → F U U → ε | M F U M → * | / F → - F | ( E ) | ident | num

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 16 / 54

SLIDE 21

Implementation of Syntactic Analysis

Parse (): init-scanner() t := next-token() Parse-S()

S → E eof

Parse-S (): Parse-E() if t = T EOF then error()

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 17 / 54

SLIDE 22

Implementation of Syntactic Analysis

E → T G

Parse-E (): Parse-T() Parse-G()

G → ε | A T G

Parse-G (): if t ∈ {T Plus, T Minus} then Parse-A() Parse-T() Parse-G()

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 18 / 54

SLIDE 23

Implementation of Syntactic Analysis

T → F U

Parse-T (): Parse-F() Parse-U()

U → ε | M F U

Parse-U (e1): if t ∈ {T Star, T Slash} then Parse-M() Parse-F() parse-U()

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 19 / 54

SLIDE 24

Implementation of Syntactic Analysis

A → + | -

Parse-A (): switch t do case T Plus do t := next-token() case T Minus do t := next-token()

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 20 / 54

SLIDE 25

Implementation of Syntactic Analysis

M → * | /

Parse-M (): switch t do case T Star do t := next-token() case T Slash do t := next-token()

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 21 / 54

SLIDE 26

Implementation of Syntactic Analysis

F → ident | num | ( E ) | - F

Parse-F (): switch t do case T Ident do t := next-token() case T Number do t := next-token() case T LParen do t := next-token() Parse-E() if t = T RParen then error() t := next-token() case T Minus do t := next-token() Parse-F()

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 22 / 54

SLIDE 27

Implementation of Syntactic Analysis

If a function ends with a recursive call of itself, as for example function Parse-G(), it is possible to replace this recursion with an iteration. Functions Parse-E() and Parse-G() can be merged into one function. Similarly, it is possible to replace a recursion with an iteration in function Parse-U(), and functions Parse-T() and Parse-U() can be merged into one function.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 23 / 54

SLIDE 28

E → T G G → ε | A T G

Parse-E (): Parse-T() while t ∈ {T Plus, T Minus} do Parse-A() Parse-T()

T → F U U → ε | M F U

Parse-T (): Parse-F() while t ∈ {T Star, T Slash} do Parse-M() Parse-F()

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 24 / 54

SLIDE 29

Implementation of Syntactic Analysis

The implementation described above just finds out whether the given input corresponds to some word that can be generated by the given grammar. If this is the case, it reads whole input and finishes successfully. If it is not the case, function error() is called. In real implementation, it is useful to provide function error() with error messages describing the kind of error together with the information about a position in the input where the error occurred (e.g., this line and column where the currently processed token starts). Function error() can use this information to create error messages that are displayed to a user.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 25 / 54

SLIDE 30

Implementation of Syntactic Analysis

Typically, we do not want to use syntactic analysis just to check that the input is correct but also to create abstract syntax tree or to perform some other types of actions connected with individual rules of the grammar. The previously presented code can be used as a base that can be extended with other actions such as construction of an abstract syntax tree, modifications of read expressions, and possibly some

ther types of computation.

When the functions that correspond to nonterminals should create the corresponding abstract syntax tree, they can return the constructed subtree, corresponding to the part of the expression generated from the given nonterminal, as a return value.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 26 / 54

SLIDE 31

Implementation of Syntactic Analysis

Construction of an abstract syntax tree: An enumerated type representing binary arithmetic operations: enum Bin op { Add, Sub, Mul, Div } An enumerated type representing unary arithmetic operations: enum Un op { Un minus } Functions for creation of different kinds of nodes of an abstract syntax tree:

mk-var(ident) — creates a leaf representing a variable mk-num(num) — creates a leaf representing a number constant mk-unary(op, e) — creates a node with one child e, on which a unary operation op (of type Un op) is applied mk-binary(op, e1, e2) — creates a node with two children e1 and e2,

n which a binary operation op (of type Bin op) is applied
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 27 / 54

SLIDE 32

Implementation of Syntactic Analysis

S → E eof

Parse (): init-scanner() t := next-token() e := Parse-E() if t = T EOF then error() return e

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 28 / 54

SLIDE 33

Implementation of Syntactic Analysis

E → T G G → ε | A T G

Parse-E (): e1 := Parse-T() while t ∈ {T Plus, T Minus} do

p := Parse-A()

e2 := Parse-T() e1 := mk-binary(op, e1, e2) return e1

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 29 / 54

SLIDE 34

Implementation of Syntactic Analysis

A → + | -

Parse-A (): switch t do case T Plus do t := next-token() return Add case T Minus do t := next-token() return Sub

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 30 / 54

SLIDE 35

Implementation of Syntactic Analysis

T → F U U → ε | M F U

Parse-T (): e1 := Parse-F() while t ∈ {T Star, T Slash} do

p := Parse-M()

e2 := Parse-F() e1 := mk-binary(op, e1, e2) return e1

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 31 / 54

SLIDE 36

Implementation of Syntactic Analysis

M → * | /

Parse-M (): switch t do case T Star do t := next-token() return Mul case T Slash do t := next-token() return Div

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 32 / 54

SLIDE 37

F → ident | num | ( E ) | - F

Parse-F (): switch t do case T Ident do e := mk-var(last-ident) t := next-token() return e case T Number do e := mk-num(last-num) t := next-token() return e case T LParen do t := next-token() e := Parse-E() if t = T RParen then error() t := next-token() return e case T Minus do t := next-token() e := Parse-F() return mk-unary(Un minus, e)

therwise do error()
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 33 / 54

SLIDE 38

Reduction of a Context-Free Grammar

Definition

A context-free grammar G = (Π, Σ, S, P) is reduced if for every A ∈ Π: there are some u, v ∈ Σ∗ such that S ⇒∗ uAv, and there is some w ∈ Σ∗ such that A ⇒∗ w. Remark: Obviously, if S ⇒∗ uAv and A ⇒∗ w where u, v, w ∈ Σ∗, then S ⇒∗ uwv, and so A is used in some derivation of a word from Σ∗. On the other hand, if A is used in some derivation S ⇒∗ z of a word z ∈ Σ∗, then z can be divided into parts u, v, w such that z = uwv and S ⇒∗ uAv and A ⇒∗ w.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 34 / 54

SLIDE 39

Reduction of a Context-Free Grammar

Obviously, every A ∈ Π with the property that there are no u, v ∈ Σ∗ such that S ⇒∗ uAv, or there is no w ∈ Σ∗ such that A ⇒∗ w, can be safely removed from the grammar (together with all rules where it

ccurs) without affecting the generated language.
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 35 / 54

SLIDE 40

Reduction of a Context-Free Grammar

An algorithm that for a given CFG G contructs an equivalent reduced grammar:

1 Construct the set T of all nonterminals that can generate a terminal

word: T = { A ∈ Π | (∃w ∈ Σ∗)(A ⇒∗ w) }

2 Remove from G all nonterminals from the set Π − T together with all

rules where they occur. Denote the rusulting grammar G ′ = (Π′, Σ, S, P ′).

3 Construct the set D of all nonterminals that can be “reached” from

the initial nonterminal S: D = { A ∈ Π′ | (∃α, β ∈ (Π′ ∪ Σ)∗)(S ⇒∗ αAβ) }

4 Remove from G ′ all nonterminals from the set Π′ − D together with

all rules where they occur. The rusulting grammar G ′′ is the result of the whole algorithm.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 36 / 54

SLIDE 41

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 42

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 43

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 44

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 45

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 46

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S} S → AC A → aC | AbA C → aa D → aA | ε

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 47

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S} S → AC A → aC | AbA C → aa D → aA | ε D0 = {S}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 48

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S} S → AC A → aC | AbA C → aa D → aA | ε D0 = {S} D1 = {S, A, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 49

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S} S → AC A → aC | AbA C → aa D → aA | ε D0 = {S} D1 = {S, A, C} D = {S, A, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 50

Reduction of a Context-Free Grammar

Example: S → AC | B A → aC | AbA B → Ba | BbA | DB C → aa | aBC D → aA | ε T0 = {C, D} T1 = {C, D, A} T2 = {C, D, A, S} T = {C, D, A, S} S → AC A → aC | AbA C → aa D → aA | ε D0 = {S} D1 = {S, A, C} D = {S, A, C} S → AC A → aC | AbA C → aa

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 37 / 54

SLIDE 51

Some Properties of Context-free Grammars

Let us assume we have a context-free grammar G = (Π, Σ, S, P). We can easily construct algorithms for the following problems dealing with some properties of context-free grammar G: To find out for given α ∈ (Π ∪ Σ)∗ whether α ⇒∗ ε. To find, for given α ∈ (Π ∪ Σ)∗, the set first(α), where first(α) = { a ∈ Σ | α ⇒∗ aβ for some β ∈ (Π ∪ Σ)∗ } To find, for given α ∈ (Π ∪ Σ)∗, the set last(α), where last(α) = { a ∈ Σ | α ⇒∗ βa for some β ∈ (Π ∪ Σ)∗ }

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 38 / 54

SLIDE 52

Some Properties of Context-free Grammars

To find, for given nonterminal A ∈ Π, the set follow(A), where follow(A) = { a ∈ Σ | S ⇒∗ β1A a β2 for some β1, β2 ∈ (Π ∪ Σ)∗ } To find all nonterminals A ∈ Π, for which grammar G contains the left recursion, i.e., those for which A ⇒+ Aα for some α ∈ (Π ∪ Σ)∗ To find all nonterminals A ∈ Π, for which grammar G contains the right recursion, i.e., those for which A ⇒+ αA for some α ∈ (Π ∪ Σ)∗ Remark: Notation α ⇒+ β, where α, β ∈ (Π ∪ Σ)∗, denotes that α can be rewritten to β (i.e., α ⇒∗ β) by a derivation with a nonzero number of steps.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 39 / 54

SLIDE 53

Some Properties of Context-free Grammars

To be able to use a given context-free grammar G for a straightforward implementation of recursive descent, it must have some particular properties: It must not contain left recursion. For each nonterminal A ∈ Π and all rules with A on the left-hand side, i.e., A → α1 | α2 | · · · | αn the sets first(α1), first(α2), . . . , first(αn) must be pairwise disjoint. For every nonterminal A ∈ Π and all rules A → α1 | α2 | · · · | αn there can be at most one right-hand side αi such that αi ⇒∗ ε. If there is such right-hand side (and so A ⇒∗ ε), the sets first(α1), first(α2), . . . , first(αn) must be disjoint with the set follow(A).

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 40 / 54

SLIDE 54

Removing Epsilon-rules

Rules of the form A → ε are called epsilon-rules (ε-rules).

Proposition

For every context-free grammar G there is a context-free grammar G ′ without ε-rules such that L(G ′) = L(G) − {ε}. Proof: Construct the set E of all nonterminals that can be rewritten to ε, i.e., E = { A ∈ Π | A ⇒∗ ε } Remove all ε-rules and replace every other rule A → α with a set of rules

btained by all possible rules of the form A → α′ where α′ is obtained

from α by possible ommitting of (some) occurrences of nonterminals from E.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 41 / 54

SLIDE 55

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 56

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a E0 = {B}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 57

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a E0 = {B} E1 = {B, D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 58

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a E0 = {B} E1 = {B, D} E2 = {B, D, A}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 59

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a E0 = {B} E1 = {B, D} E2 = {B, D, A} E = {B, D, A}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 60

Removing Epsilon-rules

Example: S → ASA | aBC | b A → BD | aAB B → bB | ε C → AaA | b D → AD | BBB | a E0 = {B} E1 = {B, D} E2 = {B, D, A} E = {B, D, A} S → ASA | SA | AS | S | aBC | aC | b A → BD | B | D | aAB | aB | aA | a B → bB | b C → AaA | aA | Aa | a | b D → AD | D | A | BBB | BB | B | a

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 42 / 54

SLIDE 61

Removing Epsilon-rules

For every context-free grammar G = (Π, Σ, S, P) there is a context-free grammar G ′ = (Π′, Σ, S ′, P ′) such that L(G ′) = L(G) and either: G ′ does not contain ε-rules, or the only ε-rule in G ′ is the rule S ′ → ε and S ′ does not occur on the right-hand side of any rule in G ′.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 43 / 54

SLIDE 62

Removing Unit-rules

Rules of the form A → B where A, B ∈ Π are called unit rules.

Proposition

For every context-free grammar G there is a context-free grammar G ′ without ε-rules and without unit rules such that L(G ′) = L(G) − {ε}. Proof: Assume G = (Π, Σ, S, P) does not contain ε-rules. For each A ∈ Π compute the set NA of all nonterminals that can be

btained from A by using only unit rules, i.e.,

NA = { B ∈ Π | A ⇒∗ B } Construct CFG G ′ = (Π, Σ, S, P ′) where P ′ consist of rules of the form A → β where A ∈ Π, β is not a single nonterminal, and (B → β) ∈ P for some B ∈ NA.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 44 / 54

SLIDE 63

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 64

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 65

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

N 1

S = {S, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 66

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

N 1

S = {S, C}

N 2

S = {S, C, D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 67

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

N 1

S = {S, C}

N 2

S = {S, C, D}

N 3

S = {S, C, D, B}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 68

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

N 1

S = {S, C}

N 2

S = {S, C, D}

N 3

S = {S, C, D, B}

N 0

A = {A}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 69

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 70

Removing Unit-rules

Example: S → AB | C A → a | bA B → C | b C → D | AA | AaA D → B | ABb

N 0

S = {S}

N 1

S = {S, C}

N 2

S = {S, C, D}

N 3

S = {S, C, D, B}

N 0

A = {A}

N 0

B = {B}

N 1

B = {B, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

B = {B, C, D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

B = {B, C, D}

N 0

C = {C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

B = {B, C, D}

N 0

C = {C}

N 1

C = {C, D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

C = {C, D, B}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

C = {C, D, B}

N 0

D = {D}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

C = {C, D, B}

N 0

D = {D}

N 1

D = {D, B}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

D = {D, B, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

D = {D, B, C}

NS = {S, C, D, B} NA = {A} NB = {B, C, D} NC = {C, D, B} ND = {D, B, C}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

D = {D, B, C}

NS = {S, C, D, B} NA = {A} NB = {B, C, D} NC = {C, D, B} ND = {D, B, C} S → AB | AA | AaA | ABb | b A → a | bA B → b | AA | AaA | ABb C → AA | AaA | ABb | b D → ABb | b | AA | AaA

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 45 / 54

SLIDE 80

Chomsky Normal Form

Definition

A context-free grammar is in Chomsky normal form if every rule is of on

f the following forms:

A → BC A → a where a is any terminal and A, B,and C are any nonterminals. In addition we permit the rule S → ε, where S the initial nonterminal. In that case, nonterminal S cannot occur on the right-hand side of any rule.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 46 / 54

SLIDE 81

Chomsky Normal Form

Proposition

For every context-free grammar G there is an equivalent context-free grammar G ′ in Chomsky normal form. Proof: Perform the following transformations on G:

1 Decompose each rule A → α where |α| ≥ 3 into a sequence of rules

where each right-hand size has length 2.

2 Remove ε-rules. 3 Remove unit rules. 4 For each terminal a occurring on the right-hand size of some rule

A → α where |α| = 2 introduce a new nonterminal Na, replace

ccurrences of a on such right-hand sides with Na, and add Na → a

as a new rule.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 47 / 54

SLIDE 82

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 83

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 84

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε Step 2: E = {B, A}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 85

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε Step 2: E = {B, A} S0 → S S → AZ | Z | aB | a Z → SA | S A → B | S B → b

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 86

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε Step 2: E = {B, A} S0 → S S → AZ | Z | aB | a Z → SA | S A → B | S B → b Step 3: NS0 = {S0, S, Z} NS = {S, Z} NZ = {Z, S} NA = {A, B, S, Z} NB = {B}

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 87

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε Step 2: E = {B, A} S0 → S S → AZ | Z | aB | a Z → SA | S A → B | S B → b Step 3: NS0 = {S0, S, Z} NS = {S, Z} NZ = {Z, S} NA = {A, B, S, Z} NB = {B} S0 → AZ | aB | a | SA S → AZ | aB | a | SA Z → SA | AZ | aB | a A → b | AZ | aB | a | SA B → b

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 88

Chomsky Normal Form

Example:

S → ASA | aB A → B | S B → b | ε Step 1: S → AZ | aB Z → SA A → B | S B → b | ε Step 2: E = {B, A} S0 → S S → AZ | Z | aB | a Z → SA | S A → B | S B → b Step 3: NS0 = {S0, S, Z} NS = {S, Z} NZ = {Z, S} NA = {A, B, S, Z} NB = {B} S0 → AZ | aB | a | SA S → AZ | aB | a | SA Z → SA | AZ | aB | a A → b | AZ | aB | a | SA B → b Step 4: S0 → AZ | YB | a | SA S → AZ | YB | a | SA Z → SA | AZ | YB | a A → b | AZ | YB | a | SA B → b Y → a

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 48 / 54

SLIDE 89

Chomsky Normal Form

Grammar G = (Π, Σ, S, P)in Chomsky normal form has some properties that allow to determine whether w ∈ Σ∗ belongs to the language generated by grammar G (i.e., if w ∈ L(G)): Let us assume that w ∈ L(G) (and so S ⇒∗ w)and that |w| = n, where n ≥ 1. Then for (every) derivation S ⇒∗ w holds:

The rules of the form A → a (i.e., a nonterminal is rewritten to exactly one terminal) are used in exactly n steps of the derivation. The rules of the form A → BC (i.e., a nonterminal is rewritten to a pair of nonterminals) are used in exactly n − 1 steps of the derivation.

So every derivation S ⇒∗ w, where |w| = n, has 2n − 1 steps, where n of these steps are of the form A → a and n − 1 of the form A → BC.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 49 / 54

SLIDE 90

Chomsky Normal Form

To find out whether S ⇒∗ w, it is sufficient to try by brute force all possible derivations of length 2n − 1. Such algorithm has exponential time complexity with respect to the length

f w.

Such systematic trying of all possibilities can be implemented by using so called dynamic programming in a way that is much more efficient than a straightforward algorithm that generates all derivations of the given length. Cocke-Younger-Kasami algorithm, with time complexity O(n3), is based

n this idea. (Assuming a fixed grammar G.)
Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 50 / 54

SLIDE 91

Cocke-Younger-Kasami Algorithm

The question if S ⇒∗ w is a special case of the question if A ⇒∗ w , where A ∈ Π is an arbitrary nonterminal and w ∈ Σ∗ is an arbitrary word consisting of terminals. It is obvious that: If |w| = 1: Then A ⇒∗ w iff there is a rule A → b in P where w = b. If |w| > 1: Then A ⇒∗ w iff there is a rule A → BC in P where for some words u and v such that w = uv, |u| ≥ 1 and |v| ≥ 1, it holds that B ⇒∗ u and C ⇒∗ v.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 51 / 54

SLIDE 92

Cocke-Younger-Kasami Algorithm

Let us assume that a word w ∈ Σ∗ with |w| = n where n ≥ 1 and w = a1a2 · · · an . Instead of solving the original question whether S ⇒∗ w, we will solve the following more general problem for all nonempty subwords v of the word w: To find the set of all nonterminals A from the set Π such that A ⇒∗ v. Let us denote the set of all nonterminals generating subword v of length i and starting on position j as F[i][j], i.e., for each A ∈ Π it holds that A ∈ F[i][j] ⇐ ⇒ A ⇒∗ ajaj+1 . . . aj+(i−1) To find out whether S ⇒∗ w, is therefore the same problem as to find out whether S ∈ F[n][1].

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 52 / 54

SLIDE 93

Cocke-Younger-Kasami Algorithm

The algorithm computes values F[i][j] at first for subwords of length 1 (i.e., i = 1), then for subwords of length 2 (i.e., i = 2), then for subwords of length 3, length 4, etc. Values F[i][j] are stored in a twodimensional array F, where 1 ≤ i ≤ n a 1 ≤ j ≤ n − i + 1, where the elements of this array are subsets of nonterminals from the set Π. In the computation of the value F[i][j] the previously computed values F[i ′][j ′], where i ′ < i, are used. Let us assume that at the beginning all elements of array F are initialized to ∅.

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 53 / 54

SLIDE 94

Cocke-Younger-Kasami Algorithm

for j := 1 to n do for each (A → b) ∈ P do if b = aj then add A to F[1][j] for i := 2 to n do for j := 1 to n − i + 1 do for k := 1 to i − 1 do for each (A → BC) ∈ P do if B ∈ F[k][j] and C ∈ F[i − k][j + k] then add A to F[i][j]

Z. Sawa (TU Ostrava)

Theoretical Computer Science November 26, 2020 54 / 54