CPSC 313: Chomsky Normal Form October 25, 2020 We want a simple - - PDF document

cpsc 313 chomsky normal form
SMART_READER_LITE
LIVE PREVIEW

CPSC 313: Chomsky Normal Form October 25, 2020 We want a simple - - PDF document

CPSC 313: Chomsky Normal Form October 25, 2020 We want a simple standard format to describe the productions motivation of a grammar so that we can do proofs and constructions more readily on the grammar. We now must show that all grammars can


slide-1
SLIDE 1

CPSC 313: Chomsky Normal Form

October 25, 2020

motivation We want a simple standard format to describe the productions

  • f a grammar so that we can do proofs and constructions more readily on the

grammar. We now must show that all grammars can be converted into this normal form. In Chomsky Normal Form (CNF) all productions are of one of the following three types: A → BC (A, B, C ∈ V ) A → a (A ∈ V and a ∈ Σ) S → ǫ (where S ∈ V and is the start symbol) S does not occur on the right hand side of any production. ((T, α) ∈ P ⇒ nS(α) = 0)

  • remove S from the right side of productions
  • remove epsilon productions not from the start symbol (A → ǫ and A = S)
  • remove mixing terminals and non-terminals on right hand side (A →

abCD)

  • remove more than 1 alphabet symbols on right hand side (A → aA)
  • remove more than 2 variables in its righthand side (A → ABC)
  • remove productions of length one that are to variables (A → B)

That is, remove any productions that violate the rules. To remove the epsilon transitions we just think about what variables can derive epsilon, and replace them by omiting them in productions that create that variable. For example if the variable A ⇒⋆ ǫ, and we have productions like B → ABC then we add production B → BC. We need to include S → ǫ if S ⇒⋆ ǫ as the base case. For the unit productions (A → B) we just allow A to derive everything B can derive. So if we had A → B and B → CD|aa we just add A → CD and A → aa and remove A → B To remove alphabet symbols and variables in our rules we simply promote symbols to a variable and include a rule from that variable to the symbol as its

  • nly derivation. If we had A → aB we can add a new rule Xa → a and replace

A → aB with A → XaB

slide-2
SLIDE 2

Finally we need to get rid of productions that have more than 2 variables

  • n the right hand side (all that we still need to deal with, that is, these are

the only non-CNF productions that remain in our grammar) and we do this by chaining them as follows: A → BCDE we replace this with: A → BA1 A1 → CA2 A2 → DE Removing epsilon productions

  • 1. determine the nullable variable set (variables that can derive ǫ A ⇒⋆ ǫ):

(a) let T be the set of all variables A such that A → ǫ is a production in

  • ur grammar. (base case)

(b) for all right hand sides of a production B → x1x2 . . . xr, if all xi are nullable then B is nullable. Add B to our set T of nullable variables. (recursive case) (c) continues until no new variables can be added to our nullable set.

  • 2. Add new productions where any nullable variables A is replaced by either

A or ǫ in all possible ways. For example, suppose that C, D are nullable and A, B are not and we have a production of the form H → CADB then we would add the following productions to our grammar H → ADB and H → CAB and H → AB to accomodate the possibilities of the variables going to the empty string.

  • 3. Remove all productions of the form A → ǫ
  • 4. If S is in our nullable set T then add the production S → ǫ

Question: Why is it insufficent to write “ remove all productions A → ǫ except for S”? An example: C → DE D → FG E → ǫ F → ǫ G → ǫ Round 1: T = {E, F, G} Round 2: T = {D, E, F, G} Round 3: T = {C, D, E, F, G}

slide-3
SLIDE 3

Remove Unit Productions A unit production is A → B or more generally A ⇒⋆ B

  • 1. remove all ǫ productions
  • 2. for each pair of variables (A, B) such that A ⇒⋆ B and B → α is a

production (α ∈ (V ∪ Σ)⋆) we add the production A → α to our grammar

  • 3. remove all unit productions

If B can derive some sentinels forms and A can derive B in some number of steps, we allow A to derive all sentinel forms that B can derive directly instead

  • f requiring that A first turn into B through an effective unit production and

instead derive the things that B can derive directly as rules in our grammar. After we remove all rules of the form A → B we no longer have any more rules in our grammar of the form A → α where |α| = 1 and α ∈ V ⋆ An algorithm to find pairs (A, B) such that A ⇒⋆ B is the following: For each variable A create a set TA := {A} Then, we do the following: for all productions of the form X → Y where X ∈ TA then we add Y to TA Example of this algorithm: A → BF|D|a B → CA|b|E C → c D → B|AC|d E → CC F → E|f Initially, we create the following sets: TA = {A} TB = {B} TC = {C} TD = {D} TE = {E} TF = {F} then after running through the set of rules in our grammar once, we get the following for our sets: TA = {A, D} TB = {B, E} TC = {C} TD = {D, B} TE = {E} TF = {F, E} We repeat this recursive step until we are not adding anything new to our sets. TA = {A, D, B} TB = {B, E} TC = {C}

slide-4
SLIDE 4

TD = {D, B, E} TE = {E} TF = {F, E} We run it again: TA = {A, D, B, E} TB = {B, E} TC = {C} TD = {D, B, E} TE = {E} TF = {F, E} We then run it again, notice that we don’t change any of the sets, and terminate. Now we have a bunch of sets TX such that for all Y ∈ TX it is the case that X ⇒⋆ Y . Note that this corresponds to a unit production since both X and Y are variables. Then we can continue with the algorithm to explicitly add rules for X corresponding to what Y can derive. Promoting Symbols to Variables At this point we have only productions in our grammar of the form: S → ǫ A → a A → α where |α| ≥ 2 ∧ α ∈ (V ∪ Σ)⋆ We want to ensure that instead all such productions where |α| ≥ 2 are strings

  • ver V ⋆ and not over (V ∪ Σ)⋆

We make a new variable for each symbol that appears on the right hand side and a production for new variable to derive the symbol. if a ∈ Σ and we see that α = βaγ then we add a new variable Xa to our grammar with a production Xa → a as its only production and then we rewrite the production βaγ as βXaγ After we do this for all terminal symbols on the right hand side of productions with length greater or equal to 2, we now have that all such productions are strings over variables alone, not over alphabet symbols. Chain long forms into pairs A → BCDE we effectively replace it with a bunch of variables and productions, each with length two on the right hand side, and each effectively continuing along the chain. A → BX1 X1 → CX2 X2 → DE Seen as parse trees, the difference is:

A B C D E

is the original rule and

slide-5
SLIDE 5

A B X1 C X2 D E

is the chained varient After doing this step, we end up with a grammar that has only productions involving variables having length 2. As a result, all productions now are compliant to the format of CNF and at no step in our process did we change the language that the grammar produced. Therefore, we have taken an arbitrary grammar and converted it into CNF.