Formal Languages CS 100: Introduction to the Profession Matthew - - PowerPoint PPT Presentation

formal languages
SMART_READER_LITE
LIVE PREVIEW

Formal Languages CS 100: Introduction to the Profession Matthew - - PowerPoint PPT Presentation

Formal Languages CS 100: Introduction to the Profession Matthew Bauer & Michael Saelee Some languages - Natural languages: English, Chinese, Thai - Programming languages: Java, Lisp, Lambda calculus - Domain specific languages: SQL,


slide-1
SLIDE 1

Formal Languages

CS 100: Introduction to the Profession Matthew Bauer & Michael Saelee

slide-2
SLIDE 2

Some languages

  • “Natural” languages: English, Chinese, Thai
  • Programming languages: Java, Lisp, Lambda calculus
  • Domain specific languages: SQL, HTML/CSS, UML
  • Axiomatic systems: Propositional calculus, Set theory
slide-3
SLIDE 3

Languages: what for?

  • Socializing
  • Artistic expression
  • Communicating thoughts
  • Representing problems
  • Formalizing ideas
slide-4
SLIDE 4

Who cares?

  • Linguists: how to describe/categorize natural languages?
  • Philosophers: what kinds of (valid) thoughts can we express?
  • Mathematicians: how can we manipulate axiomatic systems?
  • Computer scientists: how do we use languages to reason

about, specify, and perform computational tasks?

slide-5
SLIDE 5

Formally ...

  • A language consists of all well-formed, finite-length strings of

symbols drawn from some alphabet.

  • “well-formed” according to some rules/constraints
  • strings ≈ words, sentences, formulae
  • symbols ≈ letters, tokens, terminals
slide-6
SLIDE 6

e.g. language over { I, love }*

  • Constraint: sentences begin with “I” and can’t be empty
  • Valid sentences (infinite in number!):
  • I
  • I I I love
  • I love I love I love love love

“Kleene star”

slide-7
SLIDE 7

Syntax vs. Semantics

  • A formal language is strictly a syntactic specification
  • i.e., no ascription of semantics/meaning
  • “Colorless green ideas sleep furiously” (Chomsky, 1957) is a

well-formed but nonsensical English sentence

  • Most applications of formal languages also require semantic

interpretation to be useful (but not all!)

slide-8
SLIDE 8

Applications in CS

  • Data validation and recognition
  • Parsing / Syntax-checking; e.g., vis-a-vis compiling
  • Programming language specification
  • Complexity theory; e.g., how much computational power is

needed to recognize all strings of a given language?

slide-9
SLIDE 9

Working with languages

  • Formal grammars generate languages
  • Automatons accept strings of a language
  • Regular expressions match strings of a language
  • Parsers analyze/deconstruct strings of a language
slide-10
SLIDE 10

Formal Grammars

A formal grammar consists of:

  • 1. a set of terminal symbols Σ; i.e., the alphabet
  • 2. a set of non-terminal symbols N; aka variables
  • 3. a set of productions P of the form symbol(s) → symbol(s)
  • left hand side must contain at least one non-terminal
  • 4. a start symbol S
slide-11
SLIDE 11

Chomsky Hierarchy

  • Grammars are categorized by the Chomsky Hierarchy
  • Type 0: no extra constraints
  • Type 1, aka “Context-Sensitive”: # symbols on left hand side of each production

must be ≤ # symbols on right hand side

  • Type 2, aka “Context-Free”: left hand side of each production can only have one

symbol (a non-terminal)

  • Type 3, aka “Regular”: each production can only be of the form A → a or

A → aB, where A and B are non-terminals, and a is a terminal

slide-12
SLIDE 12

Chomsky Hierarchy

All languages Type 0 languages Type 1: Context-sensitive languages Type 2: Context-free languages Type 3: Regular languages

slide-13
SLIDE 13

Grammars & Languages

  • The language generated by a given grammar is the set of all

strings we can derive from the start symbol

  • Recall: grammars are just one way of specifying languages
  • Not all languages can be described by grammars!
slide-14
SLIDE 14

e.g. CFG (Matched parentheses)

  • Σ = { (, ) }; N = { S }, S = S
  • Productions:
  • S → SS
  • S → ( S )
  • S → ε

empty string

slide-15
SLIDE 15

e.g. CFG (Matched parentheses)

  • Σ = { (, ) }; N = { S }, S = S
  • Productions (using alternation):
  • S → SS | ( S ) | ε
  • e.g. deriving the string ( ( )( ) )
  • S ⇒ ( S ) ⇒ ( SS ) ⇒ ( ( S )( S ) ) ⇒ ( ( )( ) )
slide-16
SLIDE 16

Derivation strategies

  • If we have a string of multiple non-terminals during the

derivation process, we have to decide which to expand first

  • Two common strategies:
  • Leftmost derivation: expand the leftmost non-terminal
  • Rightmost derivation: expand the rightmost non-terminal
slide-17
SLIDE 17

S → SS | ( S ) | ε

  • Using leftmost derivation, derive:
  • ()()()
  • (())()(())
slide-18
SLIDE 18

e.g. CFG (Simple arithmetic)

Expr → Expr + Expr | Expr × Expr | 0 | 1 | 2 | … | 9

  • Derivation for 5 + 2 × 3?
slide-19
SLIDE 19

Parse trees

  • Describe how a string is derived from some non-terminal
  • The root node represents the start symbol
  • Internal nodes represent non-terminals
  • Leaf nodes represent terminals
slide-20
SLIDE 20

Expr Expr Expr + 5 × Expr Expr 2 3 Expr Expr Expr × + Expr Expr 5 2 3

  • r

Expr → Expr + Expr | Expr × Expr | 0 | 1 | 2 | … | 9

  • Parse tree for 5 + 2 × 3?
  • This grammar is ambiguous; i.e., it may produce multiple

parse trees for a given string

slide-21
SLIDE 21

Ambiguous grammars

  • May be problematic, especially if semantics are ascribed to

substructures of the parse tree

  • E.g., arithmetic precedence, control structure nesting
slide-22
SLIDE 22

Expr Expr Expr + 5 × Expr Expr 2 3 Expr Expr Expr × + Expr Expr 5 2 3

  • r

Expr → Expr + Expr | Expr × Expr | 0 | 1 | 2 | … | 9

  • Parse tree for 5 + 2 × 3?

this is the desired parse tree! (why?)

slide-23
SLIDE 23

“Fixing” ambiguous grammars

  • Rewrite grammar so it is no longer ambiguous but generates

the same language (can be hard/impossible!)

  • May result in different parse trees
  • Add disambiguating productions to force the desired parse

trees to be generated

slide-24
SLIDE 24

e.g. CFG (Simple arithmetic)

Expr → Term | Expr + Term Term → Factor | Term × Factor Factor → 0 | 1 | 2 | … | 9

slide-25
SLIDE 25
  • Parse tree for 5 + 2 × 3?

Expr Expr Term + Term × Term Factor Factor 3 Factor 2 5

slide-26
SLIDE 26

e.g. CFG (Simple arithmetic)

We can update our grammar to allow for parentheses: Expr → Term | Expr + Term Term → Factor | Term × Factor Factor → 0 | 1 | 2 | … | 9 | ( Expr )

slide-27
SLIDE 27

Expr → Term | Expr + Term Term → Factor | Term × Factor Factor → 0 | 1 | 2 | … | 9 | ( Expr )

  • Using leftmost derivation, show the parse trees for:
  • 1 + 2 + 3
  • 1 + 2 × 3 + 4
  • (1 + 2) × (3 + 4)
slide-28
SLIDE 28

e.g. CFG (Java)

  • http://cs.au.dk/~amoeller/RegAut/JavaBNF.html
slide-29
SLIDE 29

Regular Grammars

  • Recall, productions must take the form A → a or A → aB,

where A and B are non-terminals, and a is a terminal

  • Technically, this describes a right-regular grammar; left-

regular grammars also exist (what would they look like?)

slide-30
SLIDE 30

e.g. Regular Grammar

  • A → 0A | 1B | ε
  • B → 0B | 1A
  • Derive some strings based on this grammar. What

characteristic do they share?

  • All strings have an even number of 1s; aka even parity
slide-31
SLIDE 31

Limitation & Simplicity

  • Because regular expressions only expand to the right (or left),

they cannot generate languages with nested/recursive substructures (e.g., matching parentheses)

  • Due to this simplicity, recognizing regular languages requires

limited computing power and memory

  • Finite-state machines can be used to recognize regular languages!
slide-32
SLIDE 32

e.g. FSM acceptor (even parity)

1

S0 S0 S1

1

  • Candidate strings are scanned left to right; each token follows

the appropriate state transition (start from state S0)

  • FSM fails to accept a string if a valid state transition is not

available or it fails to terminate on a final (circled) state

slide-33
SLIDE 33

Ubiquity of Regular languages

  • Despite (due to?) their relative simplicity, regular languages

are incredibly important and commonplace

  • Vast majority of simple data formats are regular languages
  • e.g., URLs, e-mail addresses, dates, numerical data, etc.
  • Even when not, useful subsets of data often are
slide-34
SLIDE 34

Regular Expressions

  • Regular expressions are another way of describing how to

match strings corresponding to regular languages

  • Can also be used to extract data from and manipulate

strings being matched

slide-35
SLIDE 35

Some Regexp Elements

  • Most characters match themselves (aka literals)
  • Metacharacters may match a set of characters

(e.g., ‘.’ matches any character, ‘\d’ matches a digit)

  • Quantifiers indicate how many of the preceding character to

match (e.g., ‘*’ = 0 or more, ‘+’ = 1 or more, ‘?’ = 0 or 1)

  • | for alternation, () for grouping, [] for character classes
slide-36
SLIDE 36

e.g. Regexps

  • mic.* matches mic, michael, mic_9c, …
  • m+ike matches mike, mmike, mmmike, …
  • r(at)+ matches rat, ratatatatat
  • (m|n)+emonic matches mnemonic, mnmnnmnemonic, ...
  • CS.?\d{3} matches CS_100, CS200, CS 351, …
slide-37
SLIDE 37

Regexp = FSM = Reg. Grammar

  • All can be used interchangeably to specify a regular language!
  • Regexps are just algebraic notation for regular grammars
  • FSMs can be designed to accept precisely the language

generated by a regular grammar

slide-38
SLIDE 38

e.g. Even parity Regexp?

1

S0 S0 S1

1

slide-39
SLIDE 39

Demo

  • https://regexr.com