Compiler Construction October 31, 2018 Compiler Construction - - PowerPoint PPT Presentation

compiler construction
SMART_READER_LITE
LIVE PREVIEW

Compiler Construction October 31, 2018 Compiler Construction - - PowerPoint PPT Presentation

Compiler Construction October 31, 2018 Compiler Construction October 31, 2018 1 / 175 Mayer Goldberg \ Ben-Gurion University Mayer Goldberg \ Ben-Gurion University Chapter 2 Goals October 31, 2018 Compiler Construction 2 / 175 Agenda


slide-1
SLIDE 1

Compiler Construction

Mayer Goldberg \ Ben-Gurion University October 31, 2018

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 1 / 175

slide-2
SLIDE 2

Chapter 2

Goals

▶ The pipeline of the compiler ▶ Introduction to syntactic analysis ▶ Further steps in ocaml

Agenda

▶ The pipeline

▶ Syntactic analysis ▶ Semantic analysis ▶ Code generation

▶ The compiler for the course ▶ The language of S-expressions ▶ More ocaml

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 2 / 175

slide-3
SLIDE 3

Refresher

Last week, we discussed

▶ The interpreter as an evaluation function ▶ The compiler as a translator & optimizer ▶ We explored the relations between interpretation & compilation

This was a rather high-level view of the area We now wish to consider compilation as a large software-project

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 3 / 175

slide-4
SLIDE 4

Compilation as translation

A compiler translates between languages:

▶ Understanding the syntax of the program

▶ What kinds of statements & expressions there are ▶ What are the various parts of these statements & expressions ▶ Are they syntactically correct

▶ Understanding the meaning of the program

▶ Do the operations make sense? ▶ What are their types? ▶ Are they used in accordance with their types? ▶ On what data is the program acting? ▶ What is returned?

▶ Once we understand the syntax and meaning of a sentence, we

can render it in another language

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 4 / 175

slide-5
SLIDE 5

The pipeline of the compiler

Since the 1950’s, the standard architecture for compilers has been a pipeline:

▶ Syntactic analysis

▶ Scanning ▶ Parsing ▶ Reading ▶ Tag-Parsing

▶ Semantic analysis ▶ Code generation

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 5 / 175

slide-6
SLIDE 6

The pipeline of the compiler

The stages in the compiler pipeline are distinguished by

▶ Function: What they do ▶ Dependencies: Which stages depend on which other ▶ Complexity: How diffjcult it is to perform a stage

In programming languages:

▶ Understanding syntax is relatively straightforward (unlike in

natural language)

▶ Understanding meaning is way harder than understanding syntax ▶ Meaning is built upon syntax (in natural languages, syntax &

meaning can be inter-dependent)

▶ Code generation is relatively straightforward (template-based)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 6 / 175

slide-7
SLIDE 7

The pipeline of the compiler

Optimizations

How optimizations fjt into the pipeline of the compiler:

▶ We distinguish [at least] two levels of optimizations:

▶ High-level optimizations (closer to the source language) would

go into the semantic analysis phase

▶ Low-level optimizations (closer to assembly language) would go

into the code generation phase

This distinction can be fuzzy. Some make it fuzzier with intermediate-level optimizations

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 7 / 175

slide-8
SLIDE 8

An example of a high-level optimization

Suppose the compiler can know that the value of n is 0 when reaching the following statement: if (n == 0) { foo(); } else { goo(n); } Then an obvious optimization to perform would be to eliminate the if-statement with: foo();

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 8 / 175

slide-9
SLIDE 9

How has the code improved:

Before

if (n == 0) { foo(); } else { goo(n); }

After

foo();

What was gained

▶ The test during run-time has been eliminated ▶ The code is shorter ▶ Possibly lead to further, cascading optimizations

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 9 / 175

slide-10
SLIDE 10

An example of a low-level optimization

Before: mov rax, 1 mov rax, 2 After: mov rax, 2

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 10 / 175

slide-11
SLIDE 11

How has the code improved:

Before

mov rax, 1 mov rax, 2

After

mov rax, 2

What was gained

▶ Saved 1 cycle ▶ Made the code smaller ▶ If this code appears within a loop, gains shall be multiplied…

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 11 / 175

slide-12
SLIDE 12

The pipeline of the compiler

Basic concepts

▶ Concrete syntax ▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 12 / 175

slide-13
SLIDE 13

Concrete syntax (continued)

The concrete syntax of a computer program is a textual stream of characters:

▶ It’s one-dimensional ▶ Lacking in structure

▶ No nesting ▶ No sub-expressions

▶ Diffjcult to work with

▶ Diffjcult to access parts ▶ Diffjcult to determine correctness

▶ Contains redundancies (spaces, comments, etc)

(define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n1)))))) Think of

▶ A text fjle ▶ Characters typed at the prompt

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 13 / 175

slide-14
SLIDE 14

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax

▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 14 / 175

slide-15
SLIDE 15

Abstract syntax

The abstract syntax of a computer program is a tree-like data-structure. It is:

▶ Multi-dimensional ▶ Conveys structure

▶ Nested ▶ Recursive (following the inductive defjnition of the grammar)

▶ Easier to work with than the concrete syntax

▶ Easier to access parts ▶ Easier to verify correctness ▶ Some syntactic correctness issues have already been decided Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 15 / 175

slide-16
SLIDE 16

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax

▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 16 / 175

slide-17
SLIDE 17

Abstract Syntax-Tree (AST)

Notice

▶ The AST is a tree ▶ No text ▶ No parenthesis ▶ No spaces, tabs, newlines ▶ The structure is evident ▶ Easy to fjnd

sub-expressions

▶ Easy to determine

correctness

▶ Easier to analyze,

transform, and compile

The AST of fact

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 17 / 175

slide-18
SLIDE 18

Concrete vs Abstract Syntax

▶ Parsing: going from concrete syntax to abstract syntax ▶ Parser: the tool that performs parsing

Concrete Syntax

▶ Lacks structure ▶ Prone to errors ▶ Hard to delimit

sub-expressions

▶ Ineffjcient to work with ▶ Concrete Syntax can be

avoided

▶ Visual languages ▶ Structure/syntax editors

Abstract Syntax

▶ Has structure ▶ Many kinds of errors are

avoided

▶ Sub-Expressions are readily

accessible

▶ Effjcient to work with

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 18 / 175

slide-19
SLIDE 19

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST)

▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 19 / 175

slide-20
SLIDE 20

Tokens

▶ The smallest, meaningful, lexical unit in a language ▶ Described using regular expressions ▶ Identifjed using DFA (a very simple model of computation) ▶ Examples

▶ Numbers ▶ [Non-nested] Strings ▶ Names (variables, functions) ▶ Punctuation

▶ Cannot handle nesting of any kind:

▶ Parenthesized expressions ▶ Nested comments ▶ etc. Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 20 / 175

slide-21
SLIDE 21

Tokens

▶ Scanning: going from characters into tokens ▶ Scanner: the tool that performs scanning ▶ Scanner generator: the tool that takes defjnitions for tokens,

using regular expressions (and callback functions), and returns a scanner

▶ Examples of scanner-generators: lex, fmex

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 21 / 175

slide-22
SLIDE 22

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token

▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 22 / 175

slide-23
SLIDE 23

Delimiters

▶ Delimiters are characters that separate tokens ▶ In most languages spaces, parentheses, commas, semicolons,

etc., are all delimiters

▶ Some tokens must be separated by delimiters

▶ Two consecutive numbers, two consecutive symbols, etc.

▶ Some tokens do not need to be separated by delimiters

▶ Two consecutive strings, an open parenthesis followed by a

number, etc.

▶ Delimiters are language-dependent

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 23 / 175

slide-24
SLIDE 24

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token 🗹 Delimiter

▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 24 / 175

slide-25
SLIDE 25

Whitespace

▶ Whitespace refers to characters that

▶ Have no graphical representation ▶ Occur between tokens ▶ Spaces within strings are not whitespaces… ▶ Serve no syntactic purpose other than as delimiters and for

indentation

▶ Whitespace is language-dependent

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 25 / 175

slide-26
SLIDE 26

Delimiters in various languages

C & Scheme

Spaces, tab, newlines, carriage returns, form feeds are examples of whitespaces

Java

Literal newline characters may not occur inside a literal string (must use \n). Otherwise, similar to C & Scheme.

Python

Leading tabs are not whitespaces because they have a clear syntactic function: They denote nesting level.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 26 / 175

slide-27
SLIDE 27

Concrete vs Abstract syntax

Artifacts of the Concrete Syntax

▶ Delimiters & whitespaces ▶ Parentheses, brackets, braces, and other grouping and nesting

mechanisms (e.g., begin...end) Re-examine the concrete and abstract syntax for the factorial function, and notice what’s gone!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 27 / 175

slide-28
SLIDE 28

Concrete vs Abstract syntax (continued)

The concrete syntax

(define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n1))))))

The abstract syntax

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 28 / 175

slide-29
SLIDE 29

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token 🗹 Delimiter 🗹 Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 29 / 175

slide-30
SLIDE 30

More on parsing

To parse computer programs in a given language, we rely on:

▶ Grammars with which to express the syntax of the language

▶ There are difgerent kinds of grammars (CFG, CSG, two-level,

etc)

▶ There are difgerent languages for expressing syntax in a

grammar (e.g., BNF)

▶ Algorithms for parsing programs as per kind of grammar ▶ Techniques (e.g., parsing combinators, DCGs)

Parser generator: Takes a description of the grammar for a language, and generates a parser. For example, yacc, bison, nearly, etc.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 30 / 175

slide-31
SLIDE 31

The pipeline of the compiler (continued)

Scanning

▶ Going from characters to tokens ▶ Identifying & grouping characters into tokens for words,

numbers, strings, etc.

▶ Parsing over tokens is more effjcient than parsing over

characters

☞ As the parser examines various ways to parse the code, the

parser can avoid re-identifying and re-building complex tokens such as numbers, strings, etc

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 31 / 175

slide-32
SLIDE 32

The pipeline of the compiler (continued)

Reading

▶ In LISP/Prolog, the parser is split into two components:

▶ The reader, or the parser for the data language ▶ The tag-parser, or the parser for the source code

▶ In LISP/Scheme/Racket/Clojure/etc, the abstract syntax for

the data is the concrete syntax for the code

▶ In Prolog, the abstract syntax for the data is the abstract syntax

for the code

▶ Prolog is the programming language with the most powerful

capabilities of refmection, i.e., code examining and working with itself.

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 32 / 175

slide-33
SLIDE 33

The pipeline of the compiler (continued)

Reading — Summary

▶ In programming languages in which the syntax of code is not a

part of the syntax of data, concrete syntax is given as a stream

  • f characters

▶ In programming languages in which the syntax of code is part of

the syntax of data, things are a bit more complex:

▶ The concrete syntax of data is a stream of characters ▶ The concrete language of code is the abstract syntax of the

data

▶ In Scheme, the language of data is called S-expressions (more on

this, later)

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 33 / 175

slide-34
SLIDE 34

The pipeline of the compiler (continued)

Tag-Parsing

▶ The tag-parser takes sexprs and returns [ASTs for] exprs ▶ Languages other than from the LISP & Prolog families do not

split parsing into a reader & tag-parser

▶ In such languages, parsing goes directly from tokens to [ASTs

for] expressions

☞ Every valid program ”used to be” [i.e., before tag-parsing] a

valid sexpr

☞ Not every valid sexpr is a valid program!

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 34 / 175

slide-35
SLIDE 35

The pipeline of the compiler (continued)

Question

A parser should:

👏 Perform optimizations 👏 Evaluate expressions 👏 Raise type-mismatch errors 👏 Find potential runtime errors (null-pointer dereferences,

array-index errors, etc.)

👎 Validate the structure of input programs against a syntactic

specifjcation

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 35 / 175

slide-36
SLIDE 36

The pipeline of the compiler (continued)

Question

Using an AST, it is impossible to:

👏 Perform code reformatting/beautifjcation/style-checking 👏 Perform optimizations 👏 Output a new program which is semantically equivalent to the

input program (code generation)

👏 Refactor the input program 👎 Generate a list of all the comments in the code

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 36 / 175

slide-37
SLIDE 37

The pipeline of the compiler (continued)

Semantic Analysis

▶ Annotate the ASTs ▶ Compute addresses ▶ Annotate tail-calls ▶ Type-check code ▶ Perform optimizations

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 37 / 175

slide-38
SLIDE 38

The pipeline of the compiler (continued)

Code Generation

▶ Generate a stream of instructions in

▶ assembly language ▶ machine language ▶ Build executable ▶ some other target language…

▶ Perform low-level optimizations

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 38 / 175

slide-39
SLIDE 39

The compiler for the course

Our compiler project

▶ Written in ocaml ▶ Support a subset of Scheme + extensions ▶ Support two, simple optimizations ▶ Compile to x86/64 ▶ Run on linux

What our project shall lack

▶ Support for the full language of Scheme ▶ Support for garbage collection ▶ Self-compilation

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 39 / 175

slide-40
SLIDE 40

S-expressions

▶ We’re going to learn about syntax by studying the syntax of

Scheme

▶ After all, we’re writing a Scheme compiler… ▶ It’s relatively simple, compared to the syntax of C, Java,

Python, and many other languages

▶ It comes with some interesting twists

▶ Scheme comes with two languages:

▶ A language for code ▶ A language for data

and there’s a tricky relationship between the two.

▶ The key to understanding the syntax of Scheme, is to think

about data

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 40 / 175

slide-41
SLIDE 41

The Language of Data

What is a language of data? — A language in which to

▶ Describe arbitrarily-complex data

▶ Possibly multi-dimensional, deeply nested ▶ Polymorphic

▶ Access components easily and effjciently

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 41 / 175

slide-42
SLIDE 42

The Language of Data (continued)

Today many languages of data are known:

▶ S-expressions (the fjrst: 1959) ▶ Functors (1972) ▶ Datalog (1977) ▶ SGML (1986) ▶ MS DDE (1987) ▶ CORBA (1991) ▶ MS COM (1993) ▶ MS DCOM (1996) ▶ XML (1996) ▶ JSON (2001)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 42 / 175

slide-43
SLIDE 43

The Language of Data (continued)

What makes S-expressions and Functors unique?

▶ They’re the fjrst… 😊 ▶ They’re supported natively, as part of specifjc programming

languages

▶ S-expressions are supported by LISP-based languages, including

Scheme & Racket

▶ Functors are supported by Prolog-based languages

▶ The language of programming is a [strict] subset of the language

  • f data

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 43 / 175

slide-44
SLIDE 44

The Language of Data (continued)

Think for a moment about the language of XML: <something>...</something>, etc

▶ It’s not supported natively by any programming language ▶ Most modern languages (Java, Python, etc) support it via

libraries

▶ No programming language is written in XML:

<package name="Foo"> <class name="Foo"> <method name="goo"> ... </method> </class> </package> This would be cumbersome, and weird!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 44 / 175

slide-45
SLIDE 45

The Language of Data (continued)

However, if some programming language both

▶ Supported XML as its data language ▶ Were itself written in XML

Then a parser for XML could also read programs written in that language:

▶ Writing interpreters, compilers, and other language-tools would

have been much simpler!

▶ Refmection (code examining code) would be simple

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 45 / 175

slide-46
SLIDE 46

The Language of Data (continued)

This is the case with S-expressions:

▶ They are the data language for LISP-based languages, including

Scheme

▶ LISP-based languages are written using S-expressions ▶ Writing interpreters and compilers in LISP-based languages is

much simpler than in other languages

▶ Computational refmection was invented in LISP! ▶ This is the real reason behind all these parentheses in Scheme:

▶ A very simple language ▶ Supports core types: pairs, vectors, symbols, strings, numbers,

booleans, the empty list, etc.

▶ A syntactic compromise that is great for expressing both code

and data

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 46 / 175

slide-47
SLIDE 47

S-expressions (continued)

Back to S-expressions

▶ S-expressions were invented along with LISP, in 1959 ▶ S-expressions stand for Symbolic Expressions ▶ The term is intended to distinguish itself from numerical

expressions

▶ Before LISP (and long after it was invented), most computation

concerned itself with numbers

▶ Computers languages were great at ”crunching numbers”, but

working with non-numeric data types was diffjcult

▶ String libraries were non-standard and uncommon ▶ Polymorphic data was unheard of ▶ Nested data structured needed to be implemented from scratch,

usually with arrays of characters and/or integers…

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 47 / 175

slide-48
SLIDE 48

S-expressions (continued)

Back to S-expressions

Then S-expressions were invented as part of a very dynamic programming language (LISP):

▶ Working with data structures became considerably simpler

▶ Trivially allocated (no pointer arithmetic) ▶ Polymorphic (lists of lists of numbers and strings and vectors of

booleans and…)

▶ Easy to access sub-structures (no pointer arithmetic) ▶ Easy to modify (in an easygoing, functional style) ▶ Easy to redefjne ▶ Automatically deallocated (garbage collection)

▶ Treating code as data became considerably simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 48 / 175

slide-49
SLIDE 49

S-expressions (continued)

Several fjelds were invented using LISP and its tools:

▶ Symbolic Mathematics (Macsyma, a precursor to Wolfram

Mathematica)

▶ Artifjcial Intelligence ▶ Computer adventure game generation languages (MDL, ZIL)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 49 / 175

slide-50
SLIDE 50

S-expressions (continued)

Defjnition: S-expressions

The language is made up of

▶ The empty list: () ▶ Booleans: #f, #t ▶ Characters: #\a, #\Z, #\space, #\return, #\x05d0, etc ▶ Strings: "abc", "Hello\nWorld\t\x05d0;hi!", etc ▶ Numbers: -23, #x41, 2/3, 2-3i, 2.34, -2.34+3.5i ▶ Symbols: abc, lambda, define, fact, list->string ▶ Pairs: (a . b), (a b c), (a (2 . #f) "moshe") ▶ Vectors: #(), #(a b ((1 . 2) #f) "moshe")

Traditionally, non-pairs are known as atoms.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 50 / 175

slide-51
SLIDE 51

S-expressions (continued)

Proper & improper lists

▶ The name LISP comes from LISt Processing. ▶ In fact, LISP has no direct support for lists. ▶ LISP has ordered pairs

▶ Ordered pairs are created using cons ▶ The fjrst and second projections over ordered pairs are car and

  • cdr. For all x, y:

▶ (car (cons x y)) ≡ x ▶ (cdr (cons x y)) ≡ y ▶ The ordered pair of x and y can be written as (x . y) Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 51 / 175

slide-52
SLIDE 52

S-expressions (continued)

The dot rules

Two rules govern how ordered pairs are printed:

▶ Rule 1: For any E, the ordered pair (E . ()) is printed as (E),

which looks like a list of 1 element.

▶ Rule 2: For any E1, E2, …, the ordered pair (E1 . (E2 — )) is

printed as (E1 E2 — )

▶ These rules just efgect how pairs are printed ▶ These rules give us a canonical representation for pairs

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 52 / 175

slide-53
SLIDE 53

S-expressions (continued)

Example

▶ The pair (a . (b . c)) is printed as (a b . c)

SYMBOL a SYMBOL b SYMBOL c PAIR CAR CDR PAIR CAR CDR

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 53 / 175

slide-54
SLIDE 54

S-expressions (continued)

Example

▶ The pair ((a . (b . ())) . ((c . (d . ())))) is

printed as ((a b) (c d))

SYMBOL a SYMBOL b NIL PAIR CAR CDR PAIR CAR CDR SYMBOL c SYMBOL d NIL PAIR CAR CDR PAIR CAR CDR NIL PAIR CAR CDR PAIR CAR CDR

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 54 / 175

slide-55
SLIDE 55

S-expressions (continued)

▶ Lists in Scheme can come in two forms, proper lists and

improper lists.

▶ When we just speak of lists, we usually mean proper lists. ▶ Most of the list processing functions (length, map, etc) all work

with proper lists.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 55 / 175

slide-56
SLIDE 56

S-expressions (continued)

Proper lists

▶ Proper lists are nested ordered pairs the rightmost cdr of which

is the empty list (aka nil)

▶ Testings for pairs is cheap, and is done by means of the builtin

predicate pair?

▶ Testing for lists is expensive, since it traverses nested, ordered

pairs, until it reaches their rightmost cdr. This is done by means of the builtin predicate list?

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 56 / 175

slide-57
SLIDE 57

S-expressions (continued)

Proper lists

Here’s a defjnition for list?: (define list? (lambda (e) (or (null? e) (and (pair? e) (list? (cdr e))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 57 / 175

slide-58
SLIDE 58

S-expressions (continued)

Improper lists

▶ Pairs that are not proper lists are improper lists. ▶ Improper lists end with a rightmost cdr that is not nil ▶ List-processing procedures such as length, map, etc., do not

work over improper lists

▶ There is no builtin procedure for testing improper lists, but it

could be written as follows: (define improper-list? (lambda (e) (and (pair? e) (not (list? (cdr e))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 58 / 175

slide-59
SLIDE 59

S-expressions (continued)

Self-evaluating forms

Booleans, numbers, characters, strings are self-evaluating forms. You can evaluate them directly at the prompt: > 123 123 > "abc" "abc" > #t #t > #\m #\m

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 59 / 175

slide-60
SLIDE 60

S-expressions (continued)

Other forms

The empty list, pairs, and vectors cannot be evaluated directly at the prompt:

▶ Entering an empty list or a vector or an improper list at the

prompt generates a run-time error.

▶ Entering a symbol at the prompt causes Scheme to attempt to

evaluate a variable by the same name

▶ Entering a proper list, that is not the empty list, at the prompt

causes Scheme to attempt to evaluate an application: > (a b c) Exception: variable b is not bound Type (debug) to enter the debugger.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 60 / 175

slide-61
SLIDE 61

S-expressions: quote & friends

To evaluate S-expressions that are not self-evaluating, we must use the form quote:

▶ The special form quote can be written in two ways:

▶ (quote <sexpr>) ▶ '<sexpr>

Both forms are equivalent

▶ When you type abc at the Scheme prompt, you’re evaluating

the variable abc

▶ When you type 'abc at the Scheme prompt, you’re evaluating

the literal symbol abc

▶ The value of the literal symbol abc is just itself, which is why

when you type 'abc at the Scheme prompt, you get back abc

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 61 / 175

slide-62
SLIDE 62

S-expressions: quote & friends

▶ When you type () at the Scheme prompt, you’re evaluating an

application with no function and no arguments! This is a syntax-error!

▶ When you type '() at the Scheme prompt, you’re evaluating a

literal empty list

▶ The value of the literal empty list is just itself, which is why

when you type '() at the Scheme prompt, you get back ()

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 62 / 175

slide-63
SLIDE 63

S-expressions: quote & friends

▶ When you type (a b c) at the Scheme prompt, you’re

evaluating the application of the procedure a to the arguments b and c, which are variables

▶ When you type '(a b c) at the Scheme prompt, you’re

evaluating the literal list (a b c)

▶ The value of the literal list (a b c) is just (a b c), which is

why when you type '(a b c) at the Scheme prompt, you get back (a b c).

▶ Quoting a self-evaluating S-expression is possible, and

redundant: > '2 2 > (+ '2 '3) 5

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 63 / 175

slide-64
SLIDE 64

S-expressions: quote & friends

So what does quote do?

▶ The quote form does nothing

▶ It is not a procedure ▶ It doesn’t take an argument ▶ It delimits a constant, literal S-expressions

▶ The syntactic function of quote in Scheme is the same as the

syntactic function of braces { ... } in C in defjning literal data: const int A[] = {4, 9, 6, 3, 5, 1};

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 64 / 175

slide-65
SLIDE 65

S-expressions: quote & friends

Meet quasiquote

▶ Simlarly to quote, the form quasiquote can be written in two

ways:

▶ (quasiquote <sexpr>) ▶ `<sexpr>

▶ quasiquote is also used to defjne data:

▶ `abc is the same as 'abc ▶ `(a b c) is the same as '(a b c)

▶ But quasiquote has two neat tricks!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 65 / 175

slide-66
SLIDE 66

S-expressions: quote & friends

Meet quasiquote

▶ The following two forms may occur within a

quasiquote-expression:

▶ The unquote form: ▶ (unquote <sexpr>) ▶ ,<sexpr> ▶ The unquote-splicing form: ▶ (unquote-splicing <sexpr>) ▶ ,@<sexpr>

▶ Both unquote & unquote-splicing are used to mix in

dynamic data into the data defjned with quasiquote

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 66 / 175

slide-67
SLIDE 67

S-expressions: quote & friends

Meet quasiquote

> '(a (+ 1 2 3) b) (a (+ 1 2 3) b) > '(a ,(+ 1 2 3) b) (a ,(+ 1 2 3) b) > `(a (+ 1 2 3) b) (a (+ 1 2 3) b) > `(a ,(+ 1 2 3) b) (a 6 b) > `(a ,(append '(x y) '(z w)) b) (a (x y z w) b) > `(a ,@(append '(x y) '(z w)) b) (a x y z w b)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 67 / 175

slide-68
SLIDE 68

S-expressions: quote & friends

Meet quasiquote

▶ The expression `(a ,(append '(x y) '(z w)) b) is

equivalent to (cons 'a (cons (append '(x y) '(z w)) '(b)))

▶ The expression `(a ,@(append '(x y) '(z w)) b) is

equivalent to (cons 'a (append (append '(x y) '(z w)) '(b)))

▶ The difgerence between unquote & unquote-splicing is that

▶ unquote mixes in an expression using cons ▶ unquote-splicing mixes in an expression using append Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 68 / 175

slide-69
SLIDE 69

S-expressions: quote & friends

Meet quasiquote

▶ Together, quasiquote, unquote, & unquote-splicing are

known as the quasiquote mechanism or the backquote mechanism

▶ The quasiquote mechanism allows us to create data by

template, that is, by specifying the shape of the data

▶ In Scheme, convenient ways to create data translate

immediately into convenient ways to create code

▶ Therefore we expect the quasiquote mechanism to have useful

applications within programming languages

▶ We can turn code that computes something into code that

shows us a computation…

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 69 / 175

slide-70
SLIDE 70

S-expressions: quote & friends

Consider the familiar factorial function: (define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n 1))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 70 / 175

slide-71
SLIDE 71

S-expressions: quote & friends

We use the quasiquote mechanism to convert the application (* n (fact (- n 1))) into code that describes what factorial does: (define fact (lambda (n) (if (zero? n) 1 `(* ,n ,(fact (- n 1)))))) Running (fact 5) now gives: > (fact 5) (* 5 (* 4 (* 3 (* 2 (* 1 1))))) As you can see, factorial now prints a trace of the computation.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 71 / 175

slide-72
SLIDE 72

S-expressions: quote & friends

We are now going to use the quasiquote mechanism to get Scheme to teach us about the structure of S-expressions. Consider the following code: (define foo (lambda (e) (cond ((pair? e) (cons (foo (car e)) (foo (cdr e)))) ((or (null? e) (symbol? e)) e) (else e)))) What does this program do?

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 72 / 175

slide-73
SLIDE 73

S-expressions: quote & friends

Let’s call foo with some arguments: > (foo 'a) a > (foo 123) 123 > (foo '()) () > (foo '(a b c)) (a b c)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 73 / 175

slide-74
SLIDE 74

S-expressions: quote & friends

Looking over the code again (define foo (lambda (e) (cond ((pair? e) (cons (foo (car e)) (foo (cdr e)))) ((or (null? e) (symbol? e)) e) (else e)))) we notice that:

▶ The 2nd and 3rd ribs of the cond overlap [we could have

removed the 2nd]

▶ All atoms are left unchanged ▶ All pairs are duplicated, while recursing over the car and cdr of

the pair So foo does nothing, though it does it recursively! ☺

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 74 / 175

slide-75
SLIDE 75

S-expressions: quote & friends

We now use the quasiquote mechanism to cause foo to generate a trace: (define foo (lambda (e) (cond ((pair? e) `(cons ,(foo (car e)) ,(foo (cdr e)))) ((or (null? e) (symbol? e)) `',e) (else e))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 75 / 175

slide-76
SLIDE 76

S-expressions: quote & friends

Running foo now gives us some interesting data: > (foo 'a) 'a > (foo '(a b c)) (cons 'a (cons 'b (cons 'c '()))) > (foo '(a 1 b 2)) (cons 'a (cons 1 (cons 'b (cons 2 '())))) > (foo 123) 123 > (foo '((a b) (c d))) (cons (cons 'a (cons 'b '())) (cons (cons 'c (cons 'd '())) '()))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 76 / 175

slide-77
SLIDE 77

S-expressions: quote & friends

▶ Using the quasiquote mechanism, we got foo to describe how

S-expressions are created using the most basic API.

▶ We should really add support for proper lists and vectors! ▶ In fact, the name describe is far more appropriate than foo…

Let’s rewrite foo…

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 77 / 175

slide-78
SLIDE 78

S-expressions: quote & friends

(define describe (lambda (e) (cond ((list? e) `(list ,@(map describe e))) ((pair? e) `(cons ,(describe (car e)) ,(describe (cdr e)))) ((vector? e) `(vector ,@(map describe (vector->list e)))) ((or (null? e) (symbol? e)) `',e) (else e))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 78 / 175

slide-79
SLIDE 79

S-expressions: quote & friends

Running describe on various S-expressions is very instructive: > (describe '(a b c)) (list 'a 'b 'c) > (describe '#(a b c)) (vector 'a 'b 'c) > (describe '(a b . c)) (cons 'a (cons 'b 'c)) > (describe ''a) (list 'quote 'a) Wait! What’s with the last example?!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 79 / 175

slide-80
SLIDE 80

S-expressions: quote & friends

Recall what we said about quote, quasiquote, unquote, & unquote-splicing:

▶ (quote <sexpr>) ≡ '<sexpr> ▶ (quasiquote <sexpr>) ≡ `<sexpr> ▶ (unquote <sexpr>) ≡ ,<sexpr> ▶ (unquote-splicing <sexpr>) ≡ ,@<sexpr>

Now we get to see this happen…

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 80 / 175

slide-81
SLIDE 81

S-expressions: quote & friends

Now we get to see this happen: > (describe ''<sexpr>) (list 'quote '<sexpr>) > (describe '`<sexpr>) (list 'quasiquote '<sexpr>) > (describe ',<sexpr>) (list 'unquote '<sexpr>) > (describe ',@<sexpr>) (list 'unquote-splicing '<sexpr>) Rule: Every Scheme expression used to be an S-expression when it was little!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 81 / 175

slide-82
SLIDE 82

S-expressions: quote & friends

Question

What is (length '''''''''''''''''moshe) ?

👏 17 👏 16 👏 Generates an error message! 👏 1 👎 2

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 82 / 175

slide-83
SLIDE 83

S-expressions: quote & friends

Explanation

(length '''''''''''''''''moshe) is the same as (length '(quote <something>)), where <something> is '''''''''''''''moshe, but that really doesn’t matter! We are still computing the length of a list of size 2:

▶ The fjrst element of the list is the symbol quote ▶ The second element of the list is '''''''''''''''moshe

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 83 / 175

slide-84
SLIDE 84

S-expressions: quote & friends (continued)

Question

The structure of the S-expression ''a in Scheme is:

👏 Just the symbol a 👏 The proper list (quote . (a . ())) 👏 The proper list (quote . (quote . (a . ()))) 👏 An invalid S-expression 👎 The nested proper list (quote . ((quote . (a . ())) .

()))

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 84 / 175

slide-85
SLIDE 85

Further reading

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 85 / 175

slide-86
SLIDE 86

Chapter 2

Goals 🗹 The pipeline of the compiler 🗹 Introduction to syntactic analysis ☞ Further steps in ocaml Agenda ☞ Ocaml

▶ Types ▶ References ▶ Modules & signatures ▶ Functional programming in ocaml Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 86 / 175

slide-87
SLIDE 87

Introduction to ocaml (2)

Still need to cover

To program in ocaml efgectively in this course , we still need to learn some additional topics:

▶ Defjning new data types ▶ Assignments, side-efgects,

What we shan’t cover

Object Orientation: Once you’re comfortable with the ocaml, you might like to pick up the object-oriented layer. As object-orientation goes, you should fjnd it to be sophisticated and expressive.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 87 / 175

slide-88
SLIDE 88

Types

New types are defjned using the type statement: type fraction = {numerator : int; denominator : int};; The above statement defjnes a new type fraction as a record consisting of two fjelds: numerator & denominator, both of type int.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 88 / 175

slide-89
SLIDE 89

Types (continued)

Once fraction has been defjned, the underlying system recognizes it for all records with these fjelds & types: # {numerator = 2; denominator = 3};;

  • : fraction = {numerator = 2; denominator = 3}

# {denominator = 3; numerator = 2};;

  • : fraction = {numerator = 2; denominator = 3}

Notice that the order of the fjelds in a record is immaterial, because the fjelds are accessed through their names, which are converted consistently into ofgsets.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 89 / 175

slide-90
SLIDE 90

Types (continued)

The type-inference engine in ocaml will correctly infer newly-defjned types: let add_fractions f1 f2 = match f1, f2 with | {numerator = n1; denominator = d1}, {numerator = n2; denominator = d2} -> {numerator = n1 * d2 + n2 * d1; denominator = d1 * d2};; And of course: # add_fractions {numerator = 2; denominator = 3} {numerator = 4; denominator = 5};;

  • : fraction = {numerator = 22; denominator = 15}

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 90 / 175

slide-91
SLIDE 91

Types (continued)

We can defjne disjoint types as follows: type number = | Int of int | Frac of fraction | Float of float;; Think of the | as disjunction. The initial | is optional in ocaml.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 91 / 175

slide-92
SLIDE 92

Types (continued)

We can now defjne a list of numbers as follows: # [Int 3; Frac {numerator = 3; denominator = 4}; Float (4.0 *. atan(1.0))];;

  • : number list =

[Int 3; Frac {numerator = 3; denominator = 4}; Float 3.14159265358979312] Notice that ocaml had no trouble identifying each of the three elements of the list as belonging to type number.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 92 / 175

slide-93
SLIDE 93

Types (continued)

Working with disjoint types

Use match to dispatch over the corresponding type constructor, and make sure you handle each and every possibility! let number_to_string x = match x with | Int n -> Format.sprintf "%d" n | Frac {numerator = num; denominator = den} -> Format.sprintf "%d/%d" num den | Float x -> Format.sprintf "%f" x;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 93 / 175

slide-94
SLIDE 94

Types (continued)

Working with disjoint types (continued)

And here’s how it looks: # number_to_string (Int 234);;

  • : string = "234"

# number_to_string (Frac {numerator = 2; denominator = 5});;

  • : string = "2/5"

# number_to_string (Float 234.234);;

  • : string = "234.234000"

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 94 / 175

slide-95
SLIDE 95

References

Let us take another look at the record-type. Recall the defjnition of fraction: # type fraction = {numerator : int; denominator : int};; type fraction = { numerator : int; denominator : int; } In the function add_fractions we used pattern-matching to access the record-fjelds.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 95 / 175

slide-96
SLIDE 96

References (continued)

Ocaml lets you access fjelds directing, using the dot-notation that is familiar from object-oriented programming: # {numerator = 3; denominator = 5}.numerator;;

  • : int = 3

# {numerator = 3; denominator = 5}.denominator;;

  • : int = 5

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 96 / 175

slide-97
SLIDE 97

References (continued)

Ocaml ofgers a special record-type known as a reference.

▶ References are derived types. For any type α, we can have a

type α ref.

▶ References are records with a single fjeld contents ▶ References have a special syntax ! to dereference the fjeld:

# {contents = 1234};;

  • : int ref = {contents = 1234}

# {contents = 1234}.contents;;

  • : int = 1234

# ! {contents = 1234};;

  • : int = 1234

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 97 / 175

slide-98
SLIDE 98

References (continued)

▶ References have a special syntax := for assignment ▶ This is how assignments are managed in ocaml

# let x = ref 1234;; val x : int ref = {contents = 1234} # x;;

  • : int ref = {contents = 1234}

# !x;;

  • : int = 1234

# x := 4567;;

  • : unit = ()

# x;;

  • : int ref = {contents = 4567}

# !x;;

  • : int = 4567

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 98 / 175

slide-99
SLIDE 99

References (continued)

▶ It is not possible to perform assignments on variables ▶ It is only possible to change the fjelds of reference types

# let x = "abc";; val x : string = "abc" # x := "def";; Characters 0-1: x := "def";; ^ Error: This expression has type string but an expression was expected of type 'a ref

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 99 / 175

slide-100
SLIDE 100

References (continued)

▶ You can defjne a reference type of any other type, including

  • ther reference types:

# let x = ref (ref 1234);; val x : int ref ref = {contents = {contents = 1234}} # x := ref 5678;;

  • : unit = ()

# x;;

  • : int ref ref = {contents = {contents = 5678}}

# !x := 9876;;

  • : unit = ()

# x;;

  • : int ref ref = {contents = {contents = 9876}}

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 100 / 175

slide-101
SLIDE 101

Modules, signatures, functors

Modules

▶ A module is a way of packaging functions, classes, variables, &

types

▶ A signature is the type of a module

▶ Visibility of a module can be restricted through the signature

▶ Functors are functions from functors/modules to

functors/modules

Goals

▶ Learn to work with existing modules ▶ Learn to write your own modules

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 101 / 175

slide-102
SLIDE 102

Modules, signatures, functors (continued)

We defjne the function hyp to compute the hypotenuse of a triangle, given two sides and the angle between them (cosine law). We use the auxiliary function square: # module M = struct let square x = x *. x let hyp a b theta = sqrt((square a) +. (square b) -. 2.0 *. a *. b *. (cos theta)) end;; module M : sig val square : float -> float val hyp : float -> float -> float -> float end

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 102 / 175

slide-103
SLIDE 103

Modules, signatures, functors (continued)

Both M.square and M.hyp are visible: # M.hyp;;

  • : float -> float -> float -> float = <fun>

# M.square;;

  • : float -> float = <fun>

# M.square 2.0;;

  • : float = 4.

# M.hyp 3.5 5.6 0.645771823239;;

  • : float = 3.50763282088818817

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 103 / 175

slide-104
SLIDE 104

Modules, signatures, functors (continued)

We defjne the module type based on the returned signature of M, but with the square function removed: # module type SigHyp = sig val hyp : float -> float -> float -> float end;; module type SigHyp = sig val hyp : float -> float -> float -> float end # module M : SigHyp = struct let square x = x *. x let hyp a b theta = sqrt((square a) +. (square b) -. 2.0 *. a *. b *. (cos theta)) end;; module M : SigHyp

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 104 / 175

slide-105
SLIDE 105

Modules, signatures, functors (continued)

Visibility is now restricted:

▶ M.hyp is visible from outside M ▶ M.square is not visible from outside M ▶ Functions visible from outside may use functions visible from

inside # M.hyp;;

  • : float -> float -> float -> float = <fun>

# M.square;; Characters 0-8: M.square;; ^^^^^^^^ Error: Unbound value M.square # M.hyp 3.5 5.6 0.645771823239;;

  • : float = 3.50763282088818817

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 105 / 175

slide-106
SLIDE 106

Modules, signatures, functors (continued)

Summary

▶ Modules & signatures are the way to package functions &

control visibility

▶ Convenient, super-effjcient, safe ▶ No need to use local, nested functions to manage visibility ▶ Always use signatures to control visibility!

Learn on your own

▶ Modules can contain types too, and be used to parameterize

code with types

▶ Simpler & better than generics & templates

▶ Functors map modules/functors =

⇒ modules/functors

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 106 / 175

slide-107
SLIDE 107

Further reading

🕯 The Objective Caml Programming Language, Chapter 12 🔘 An online tutorial on ocaml modules

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 107 / 175

slide-108
SLIDE 108

Parsing Techniques

Dozens of parsing algorithms are known:

▶ Parsing algorithms are tailored to a specifjc kind of grammar

▶ Difgerent kinds of grammars can be parsed by difgerent

algorithms

▶ Difgerent kinds of grammars have difgerent levels of complexity

  • n the Chomsky Hierarchy

▶ Most programming languages can be described using

context-free grammars

▶ Some older languages can only be described using

context-sensitive grammars

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 108 / 175

slide-109
SLIDE 109

Parsing Techniques (continued)

Context-free Grammars (CFGs)

A CFG is a structure of the form G = ⟨V, Σ, R, S⟩:

▶ V is a set of non-terminals ▶ Σ is a set of terminals, or tokens ▶ R is a relation in V × (V ∪ Σ)∗

▶ Members of R are called production rules or rewrite rules

▶ S is the an initial non-terminal

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 109 / 175

slide-110
SLIDE 110

Parsing Techniques (continued)

Context-free Grammars (conveniences)

▶ We abbreviate the two productions ⟨A, X⟩ , ⟨A, Y⟩ ∈ R with

⟨A, X | Y⟩ (disjunction)

▶ We abbreviate the three productions ⟨A, X⟩ , ⟨X, ε⟩ , ⟨X, BX⟩ ∈ R,

where X has no other productions, with ⟨A, B∗⟩, (Kleene-star)

▶ We abbreviate the three productions

⟨A, X⟩ , ⟨X, B⟩ , ⟨X, BX⟩ ∈ R, where X has no other productions, with ⟨A, B+⟩, (Kleene-plus)

▶ We abbreviate the two productions ⟨A, ε⟩ , ⟨A, B⟩ ∈ R, with

⟨ A, B?⟩ (maybe)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 110 / 175

slide-111
SLIDE 111

Parsing Techniques (continued)

The two basic approaches to parsing CFG are top-down & bottom-up:

Top-down algorithms

▶ Start with the initial non-terminal ▶ Rewrite the LHS of a non-terminal with its RHS, matching the

input stream of tokens

▶ Keep rewriting until the entire input stream is matched

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 111 / 175

slide-112
SLIDE 112

Parsing Techniques (continued)

The two basic approaches to parsing CFG are top-down & bottom-up:

Bottom-up algorithms

▶ Start with the input stream of tokens ▶ Find a rewrite rule where the RHS matches sequences in the

input, and rewrite them to the LHS, reducing several items to a single non-terminal

▶ Keep rewriting until the entire input stream has been reduced to

the initial non-terminal

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 112 / 175

slide-113
SLIDE 113

Parsing Techniques (continued)

How most parsing algorithms are used

▶ Describe the grammar of the language using a DSL for some

restricted CFG

▶ Example: Backus-Naur Form (BNF)

▶ Associate actions with each production rule:

▶ How to build the AST when a specifjc rule is matched

▶ A parser generator (e.g., yacc, bison, antlr, etc) compiles the

grammar:

▶ Performing various optimizations ▶ Generating code in some language (C, Java, ocaml, etc) ▶ This code is the parser

▶ Calling the parser on some input returns an AST

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 113 / 175

slide-114
SLIDE 114

Parsing Techniques (continued)

Goals of parsing algorithms

▶ Minimal restrictions on the grammar ▶ Avoid backtracking as much as possible ▶ Maximum optimizations of the parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 114 / 175

slide-115
SLIDE 115

Parsing Combinators

A technique for embedding a specifjcation of a grammar into a programming language:

▶ Parsers for larger languages are composed from parsers for

smaller languages

▶ The grammar can be written & debugged bottom-up ▶ The parsers are fjrst-class objects:

▶ We get to use abstraction to create complex parsers quickly &

simply

▶ Re-use efgectively common sub-languages

▶ Simple to understand & implement ▶ Very rapid development

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 115 / 175

slide-116
SLIDE 116

Parsers Combinators (continued)

Parsing combinators do have some disadvantages:

▶ The grammar is embedded as-is:

▶ As much backtracking as implied by the grammar: Rewrite

rules that have large common prefjxes are going to require plenty of backtracking: A → xByCzDt A → xByCzDw · · ·

▶ No optimizations or transformations are performed on it!

▶ ε-productions & left-recursion result in infjnite loops

▶ We need to eliminate these manually!

▶ Can produce ineffjcient parsers rather effjciently! 😊

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 116 / 175

slide-117
SLIDE 117

Parsers Combinators (continued)

Nevertheless:

▶ Parsing combinators are very simple to learn about grammars:

▶ No complex algorithms are necessary! ▶ The easiest way to design complex grammars & their parsers:

Abstraction —

▶ shortens & simplifjes the code ▶ encourages re-use & consistency

▶ Optimizations can always be done manually!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 117 / 175

slide-118
SLIDE 118

Parsers Combinators (continued)

Our parsing combinators take lists of characters for input, and return an AST. We start with code to convert strings to lists of characters: let string_to_list str = let rec loop i limit = if i = limit then [] else (String.get str i) :: (loop (i + 1) limit) in loop 0 (String.length str);;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 118 / 175

slide-119
SLIDE 119

Parsers Combinators (continued)

We shall also want to generate a string from a list of characters: let list_to_string s = let rec loop s n = match s with | [] -> String.make n '?' | car :: cdr -> let result = loop cdr (n + 1) in Bytes.set result n car; result in loop s 0;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 119 / 175

slide-120
SLIDE 120

Parsers Combinators (continued)

Sometimes our parsers must fail on their input. When this happens, we raise an exception (which in other languages is called throwing an exception). We should therefore defjne an exception: exception X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 120 / 175

slide-121
SLIDE 121

Parsers Combinators (continued)

Parsing combinators are compositional. This means

▶ We build parsers of large languages by combining parsers for

smaller [sub-]languages

▶ The procedures that combine parsers are called parsing

combinators (PCs)

▶ But we must start by being able to parse single characters

▶ All other parsers are built on top of such simple parsers for

single characters

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 121 / 175

slide-122
SLIDE 122

Parsers Combinators (continued)

The const PC takes a predicate (char -> bool), and return a parser that recognizes this character: let const pred = function | [] -> raise X_no_match | e :: s -> if (pred e) then (e, s) else raise X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 122 / 175

slide-123
SLIDE 123

Parsers Combinators (continued)

We defjne the non-terminal that recognizes the capital letter 'A' by calling const with a predicate that returns true if its argument is equal to 'A': # let ntA = const (fun ch -> ch = 'A');; val ntA : char list -> char * char list = <fun> Notice that ntA

▶ …takes a list of characters ▶ …returns a pair of what it matched, and the remaining characters

This is the structure of all parsers written using PCs

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 123 / 175

slide-124
SLIDE 124

Parsers Combinators (continued)

Using ntA

# ntA ['A'; 'B'; 'C'];;

  • : char * char list = ('A', ['B'; 'C'])

# ntA [];; Exception: PC.X_no_match. # ntA ['a'; 'A'];; Exception: PC.X_no_match.

▶ We only match the head of the input ▶ Obviously, ntA fails on an empty list

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 124 / 175

slide-125
SLIDE 125

Parsers Combinators (continued)

▶ Testing our parsers by applying them to lists is no fun

▶ It’s a pain to type lists of characters!

▶ Let’s automate things a bit:

let test_string nt str = let (e, s) = (nt (string_to_list str)) in (e, (Printf.sprintf "->[%s]" (list_to_string s)));;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 125 / 175

slide-126
SLIDE 126

Parsers Combinators (continued)

We can now test more easily: # test_string ntA "";; Exception: PC.X_no_match. # test_string ntA "Abc";;

  • : char * string = ('A', "->[bc]")

This is only for testing! When we deploy our parser, we’ll call it directly.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 126 / 175

slide-127
SLIDE 127

Parsers Combinators (continued)

Constant parsers are not very useful! Let’s consider catenation: let caten nt1 nt2 = fun s -> let (e1, s) = (nt1 s) in let (e2, s) = (nt2 s) in ((e1, e2), s);;

▶ We try to parse the head of s using nt1

▶ If we succeed, we get e1 and the remaining chars s ▶ We try to parse the head of s (what remained after nt1) using

nt2

▶ If we succeed, we get e2 and the remaining chars s ▶ We return the pair of e1 & e2, as well as the remaining chars Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 127 / 175

slide-128
SLIDE 128

Parsers Combinators (continued)

We defjne and test the parser for A followed by B: # let ntAB = caten (const (fun ch -> ch = 'A')) (const (fun ch -> ch = 'B'));; val ntAB : char list -> (char * char) * char list = <fun> # test_string ntAB "ABC";;

  • : (char * char) * string = (('A', 'B'), "->[C]")

# test_string ntAB "abc";; Exception: PC.X_no_match. # test_string ntAB "Abc";; Exception: PC.X_no_match. # test_string ntAB "AB";;

  • : (char * char) * string = (('A', 'B'), "->[]")

# test_string ntAB "A Bcdef";; Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 128 / 175

slide-129
SLIDE 129

Parsers Combinators (continued)

We now consider disjunction of two parsers: let disj nt1 nt2 = fun s -> try (nt1 s) with X_no_match -> (nt2 s);;

▶ We try to parse the head of s using nt1

▶ If we succeed, then the call to nt1 returns normally ▶ If we fail we try to parse the head of s using nt2 Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 129 / 175

slide-130
SLIDE 130

Parsers Combinators (continued)

We defjne and test the parser for either A or a: # let ntA_or_a = disj (const (fun ch -> ch = 'A')) (const (fun ch -> ch = 'a'));; val ntA_or_a : char list -> char * char list = <fun> # test_string ntA_or_a "";; Exception: PC.X_no_match. # test_string ntA_or_a "this won't work either";; Exception: PC.X_no_match. # test_string ntA_or_a "A nice example";;

  • : char * string = ('A', "->[ nice example]")

# test_string ntA_or_a "a nice example";;

  • : char * string = ('a', "->[ nice example]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 130 / 175

slide-131
SLIDE 131

Parsers Combinators (continued)

What next?

▶ Some simple parsers ▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 131 / 175

slide-132
SLIDE 132

Some simple parsers

let nt_epsilon s = ([], s);; let nt_none _ = raise X_no_match;; let nt_end_of_input = function | []

  • > ([], [])

| _ -> raise X_no_match;;

▶ nt_epsilon is the parser that recognizes ε-productions ▶ nt_none is the parser that always fails ▶ nt_end_of_input is the parser that recognizes the end of the

input stream (and fails otherwise)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 132 / 175

slide-133
SLIDE 133

Parsers Combinators (continued)

What next? 🗹 Some simple parsers

▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 133 / 175

slide-134
SLIDE 134

The Algebra of PCs

Why do nt_epsilon & nt_end_of_input match with the empty list []? This has to do with the Algebra of parsing combinators:

▶ What is the unit element of catenation?

▶ Answer: r = ε ▶ We’re looking for a non-terminal r such that for any s, we have

rs = sr = s…

▶ This means that nt_epsilon is the unit element for caten: ▶ caten nt_epsilon nt ≡ caten nt nt_epsilon ≡ nt ▶ Both nt_epsilon & nt_end_of_input are used ’til the end of

something

▶ The natural operation is to create a list of all things until ε or

the end-of-input are reached

▶ The unit element for append on lists is the empty list ▶ Ergo, it is natural to match [] when either condition is

encountered

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 134 / 175

slide-135
SLIDE 135

The Algebra of PCs (continued)

Similarly, nt_none is the unit element in the algebra of disjuction: disj nt nt_none ≡ disj nt_none nt ≡ nt

☞ Later on, we shall use the algebra of PCs together with folding

  • perations to create complex parsers easily

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 135 / 175

slide-136
SLIDE 136

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs

▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 136 / 175

slide-137
SLIDE 137

New PC Operators

Identifying the characters, or pairs of characters, etc that match a grammar is often not enough:

▶ We want to be able to create an AST for that piece of syntax ▶ We do this by specifying postprocessing or callback functions

  • ver the expression that was matched.

▶ In our package, the PC that performs this is called pack

let pack nt f = fun s -> let (e, s) = (nt s) in ((f e), s);;

▶ pack takes a non-terminal nt and a function f ▶ returns a parser that recognizes the same language as nt ▶ …but which applies f to whatever was matched Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 137 / 175

slide-138
SLIDE 138

Parsing combinators (continued)

Example: Identifying digits

# let nt_digit_0_to_9 = const (fun ch -> '0' <= ch && ch <= '9');; val nt_digit_0_to_9 : char list -> char * char list = <fun> # test_string nt_digit_0_to_9 "234";;

  • : char * string = ('2', "->[34]")

# let nt_digit_0_to_9 = pack (const (fun ch -> '0' <= ch && ch <= '9')) (fun ch -> (int_of_char ch) - ascii_0);; val nt_digit_0_to_9 : char list -> int * char list = <fun> # test_string nt_digit_0_to_9 "234";;

  • : int * string = (2, "->[34]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 138 / 175

slide-139
SLIDE 139

Recursive productions

▶ Grammars are often recursive or mutually-recursive:

▶ The non-terminal on the LHS of a production often appears on

the RHS (recursion)

▶ The non-terminal on the LHS of a production often appears in

  • ne of the RHSs of the transitive-refmexive closure of the

relation (mutual recursion)

▶ Currently, we are unable to describe recursive rules using PCs

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 139 / 175

slide-140
SLIDE 140

Recursive productions (continued)

We are unable to describe recursive rules using PCs: ⟨A⟩ → ( (⟨A⟩∗|ε) )

▶ The non-terminal A ▶ The open-parenthesis token ▶ The close-parenthesis token ▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 140 / 175

slide-141
SLIDE 141

Recursive productions (continued)

We are unable to describe recursive rules using PCs: let nt_A = caten (const (fun ch -> ch = '(')) (caten (disj (star nt_A) nt_epsilon) (const (fun ch -> ch = ')')));;

▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 141 / 175

slide-142
SLIDE 142

Recursive productions (continued)

We are unable to describe recursive rules using PCs:

▶ The problem is not specifjc to parsing combinators.

▶ For example, you couldn’t defjne in Scheme:

(define f (g (h f))) because you can’t use something before it’s defjned! (Ok, in some languages you can!)

▶ So how are recursive defjnitions possible at all?

▶ When you defjne a recursive function you are not using the

function before it’s defjned

▶ You are using the address of the function before the function is

defjned

▶ Recursive functions are circular data structures:

▶ The language defjnition permits you to defjne these particular

circular structures statically, rather than at run-time

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 142 / 175

slide-143
SLIDE 143

Parsing combinators (continued)

To implement recursive parsers, we need to delay the evaluation of the recursive non-terminal

▶ ”Wrap it in a lambda…”

let delayed thunk = fun s -> thunk() s;;

▶ A thunk is a procedure that takes zero arguments ▶ Thunks are used to delay evaluation

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 143 / 175

slide-144
SLIDE 144

Recursive productions (continued)

Example: Identifying digits (continued)

# let nt_natural = let rec make_nt_natural () = pack (caten nt_digit_0_to_9 (disj (delayed make_nt_natural) nt_epsilon)) (function (a, s) -> a :: s) in make_nt_natural();; val nt_natural : char list -> int list * char list = <fun>

▶ Notice the packing function (function (a, s) -> a :: s)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 144 / 175

slide-145
SLIDE 145

Recursive productions (continued)

Example: Identifying digits (continued)

# test_string nt_natural "1234";;

  • : int list * string = ([1; 2; 3; 4], "->[]")

We are not done yet:

▶ We got a list of digits, as opposed to a list of chars!

☞ We want to left-fold these digits into a number in base 10

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 145 / 175

slide-146
SLIDE 146

Parsers combinators (continued)

We pack the list of digits using a left-fold: # let nt_natural = let rec make_nt_natural () = pack (caten nt_digit_0_to_9 (disj (delayed make_nt_natural) nt_epsilon)) (function (a, s) -> a :: s) in pack (make_nt_natural()) (fun s -> (List.fold_left (fun a b -> 10 * a + b) s));; val nt_natural : char list -> int * char list = <fun>

▶ Notice the type of the parser: char list -> int * char list

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 146 / 175

slide-147
SLIDE 147

Recursive productions (continued)

Testing it: # test_string nt_natural "1234";;

  • : int * string = (1234, "->[]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 147 / 175

slide-148
SLIDE 148

Recursive productions (continued)

The parser ntParen expresses the grammar of one set of arbitrarily-nested parentheses: # let rec ntParen s = pack (caten (const (fun ch -> ch = '(')) (caten (disj (delayed (fun _ -> ntParen)) (pack nt_epsilon (fun _ -> "ntParen"))) (const (fun ch -> ch = ')')))) (fun _ -> "ntParen") s ;; val ntParen : char list -> string * char list = <fun>

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 148 / 175

slide-149
SLIDE 149

Recursive productions (continued)

Testing ntParen on various inputs: # test_string ntParen "()";;

  • : string * string = ("ntParen", "->[]")

# test_string ntParen "";; Exception: PC.X_no_match. # test_string ntParen "((()))";;

  • : string * string = ("ntParen", "->[]")

# test_string ntParen "((())())";; Exception: PC.X_no_match. # test_string ntParen "((()))ABC";;

  • : string * string = ("ntParen", "->[ABC]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 149 / 175

slide-150
SLIDE 150

Parsing combinators (continued)

▶ By now, our toolset of parsing combinators consists of

▶ const ▶ caten ▶ disj ▶ pack ▶ delayed

▶ We can handle recursive grammars ▶ We can create ASTs ▶ In principle, we can implement parsers for any language

☞ We now wish to add additional PCs to simplify the task of

writing parsers

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 150 / 175

slide-151
SLIDE 151

New PC Operators (continued)

The Kleene Star

The Kleene-star is a meta-production-rule, or a rule-schema, or a ”macro” over production-rules.

▶ For any NT P, P∗ stands for the

rule Pstar defjned as follows: Pstar → P Pstar | ε

▶ The point of the Kleene-star is

to recognize the catenation of zero or more expressions in P.

Stephen Cole Kleene

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 151 / 175

slide-152
SLIDE 152

New PC Operators (continued)

Here is our support for the Kleene-star: let rec star nt = fun s -> try let (e, s) = (nt s) in let (es, s) = (star nt s) in (e :: es, s) with X_no_match -> ([], s);; Notice how we match ε implicitly.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 152 / 175

slide-153
SLIDE 153

New PC Operators (continued)

The Kleene-plus

▶ For any NT P, P+ stands for the rule Pplus defjned as follows:

Pplus → P Pplus | P

▶ The point of the Kleene-plus is to recognize the catenation of

  • ne or more expressions in P.

▶ Kleene didn’t really invent the Kleene-plus

▶ Rather, Kleene-plus is a natural extension of Kleene-star Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 153 / 175

slide-154
SLIDE 154

New PC Operators (continued)

Here is our support for the Kleene-plus: let plus nt = pack (caten nt (star nt)) (fun (e, es) -> (e :: es));; Notice how we defjne the Kleene-plus as the catenation of Kleene-star and the original NT.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 154 / 175

slide-155
SLIDE 155

New PC Operators (continued)

Let’s test star and plus: # let star_star = star (const (fun ch -> ch = '*'));; val star_star : char list -> char list * char list = <fun> # let star_plus = plus (const (fun ch -> ch = '*'));; val star_plus : char list -> char list * char list = <fun> # test_string star_star "****the end!";;

  • : char list * string =

(['*'; '*'; '*'; '*'], "->[the end!]") # test_string star_plus "****the end!";;

  • : char list * string =

(['*'; '*'; '*'; '*'], "->[the end!]") # test_string star_star "the end!";;

  • : char list * string = ([], "->[the end!]")

# test_string star_plus "the end!";; Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 155 / 175

slide-156
SLIDE 156

New PC Operators (continued)

Ocaml provides the polymorphic type α option = none | Some of α as a way of dealing with situations where a value may or may not exist. We’re going to use α option to implement maybe, which takes a parser r, and returns a parser r? that recognizes zero or one

  • ccurrences of whatever is recognized by r.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 156 / 175

slide-157
SLIDE 157

New PC Operators (continued)

let maybe nt = fun s -> try let (e, s) = (nt s) in (Some(e), s) with X_no_match -> (None, s);;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 157 / 175

slide-158
SLIDE 158

New PC Operators (continued)

Assume you have the parser nt_integer, that recognizes integers. Here is how we might use maybe: # test_string nt_integer "1234";;

  • : int * string = (1234, "->[]")

# test_string (maybe nt_integer) "1234";;

  • : int option * string = (Some 1234, "->[]")

# test_string (maybe nt_integer) "moshe";;

  • : int option * string = (None, "->[moshe]")

You would use pattern matching (via match) to handle both cases (None/Some)

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 158 / 175

slide-159
SLIDE 159

New PC Operators (continued)

We might want to attach an arbitrary predicate to serve as a guard for a parser, so that the parser succeeds only if the matched object satisfjes the guard. This is what the guard PC does: let guard nt pred = fun s -> let ((e, s) as result) = (nt s) in if (pred e) then result else raise X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 159 / 175

slide-160
SLIDE 160

New PC Operators (continued)

Let’s use guard to identify only even numbers: # test_string (guard nt_integer (fun n -> n land 1 = 0)) "12345";; Exception: PC.X_no_match. # test_string (guard nt_integer (fun n -> n land 1 = 0)) "123456";;

  • : int * string = (123456, "->[]")

This exceeds the expressive power of CFGs!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 160 / 175

slide-161
SLIDE 161

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs 🗹 Learn of new PC operators

▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 161 / 175

slide-162
SLIDE 162

Functional abstraction in PCs

We now wish to demonstrate some examples of using functional abstraction to write parsers in a general, consistent, and convenient way. Up to now we used to defjne single-character parsers using const: let nt_A = const (fun ch -> ch = 'A');; This is kind of clumsy. Let’s see how we can do this better!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 162 / 175

slide-163
SLIDE 163

Functional abstraction in PCs (continued)

let make_char equal ch1 = const (fun ch2 -> equal ch1 ch2);; let char = make_char (fun ch1 ch2 -> ch1 = ch2);; let char_ci = make_char (fun ch1 ch2 -> (Char.lowercase_ascii ch1) = (Char.lowercase_ascii ch2));; The use of make_char allows us to defjne parser-generating functions for characters, in a case-sensitive or case-insensitive way.

☞ Warning: The version of ocaml installed in the labs uses

Char.lowercase, which is now deprecated. It’ll be upgraded [next year].

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 163 / 175

slide-164
SLIDE 164

Functional abstraction in PCs (continued)

# test_string (char 'a') "abc";;

  • : char * string = ('a', "->[bc]")

# test_string (char 'a') "ABC";; Exception: PC.X_no_match. # test_string (char_ci 'a') "abc";;

  • : char * string = ('a', "->[bc]")

# test_string (char_ci 'a') "ABC";;

  • : char * string = ('A', "->[BC]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 164 / 175

slide-165
SLIDE 165

Functional abstraction in PCs (continued)

If we wish to recognize entire words, this is still very cumbersome. We can put to a good use the algebra of catenation to do better: To identify a word, we —

▶ Take a string of chars, and convert it to a list ▶ Map over each character in the list, creating a parser that

recognizes that character

▶ Perofrm a right fold over that list using the caten operation

(with an approriate pack)

▶ The unit element is the unit element of catenation, namely

epsilon

By abstracing over char we can get both case-sensitive and case-insensitive variants!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 165 / 175

slide-166
SLIDE 166

Functional abstraction in PCs (continued)

Here is the code: let make_word char str = List.fold_right (fun nt1 nt2 -> pack (caten nt1 nt2) (fun (a, b) -> a :: b)) (List.map char (string_to_list str)) nt_epsilon;; let word = make_word char;; let word_ci = make_word char_ci;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 166 / 175

slide-167
SLIDE 167

Functional abstraction in PCs (continued)

# test_string (word "moshe") "moshe is a nice guy!";;

  • : char list * string =

(['m'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]") # test_string (word_ci "moshe") "Moshe is a nice guy!";;

  • : char list * string =

(['M'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 167 / 175

slide-168
SLIDE 168

Functional abstraction in PCs (continued)

We might want to pick any single character in a string. Rather than specifying long disjunctions, we can use one_of to do this for us.

▶ Very similar to word:

▶ We use disj rather than caten ▶ The unit element for disj is nt_none

Such is the power of abstraction!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 168 / 175

slide-169
SLIDE 169

Functional abstraction in PCs (continued)

let make_one_of char str = List.fold_right disj (List.map char (string_to_list str)) nt_none;; let one_of = make_one_of char;; let one_of_ci = make_one_of char_ci;; As usual, we generate both the case-sensitive and case-insensitive versions!

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 169 / 175

slide-170
SLIDE 170

Functional abstraction in PCs (continued)

Let’s try out one_of: # test_string (one_of "abcdef") "moshe!";; Exception: PC.X_no_match. # test_string (one_of "abcdef") "be moshe!";;

  • : char * string = ('b', "->[e moshe!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 170 / 175

slide-171
SLIDE 171

Functional abstraction in PCs (continued)

When we wanted to recognize a range of characters, we, once again, used the const PC. We can do better using abstraction: let make_range leq ch1 ch2 (s : char list) = const (fun ch -> (leq ch1 ch) && (leq ch ch2)) s;; let range = make_range (fun ch1 ch2 -> ch1 <= ch2);; let range_ci = make_range (fun ch1 ch2 -> (Char.lowercase_ascii ch1) <= (Char.lowercase_ascii ch2));;

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 171 / 175

slide-172
SLIDE 172

Functional abstraction in PCs (continued)

And here is how we can test range: # test_string (star (range 'a' 'z')) "hello world!";;

  • : char list * string =

(['h'; 'e'; 'l'; 'l'; 'o'], "->[ world!]") # test_string (star (range 'a' 'z')) "HELLO WORLD!";;

  • : char list * string =

([], "->[HELLO WORLD!]") # test_string (star (range_ci 'a' 'z')) "Hello World!";;

  • : char list * string =

(['H'; 'e'; 'l'; 'l'; 'o'], "->[ World!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 172 / 175

slide-173
SLIDE 173

Functional abstraction in PCs (continued)

How might you debug parsers written using PCs?

▶ The PC trace_pc is a wrapper (using the decorator pattern)

that can be used to trace any parser

▶ The trace_pc PC takes a documentation string and a parser,

and returns a tracing parser. # test_string (trace_pc "The word \"hi\"" (word "hi")) "high";; ;;; The word "hi" matched the head of "high", and the remaining string is "gh"

  • : char list * string = (['h'; 'i'], "->[gh]")

# test_string (trace_pc "The word \"hi\"" (word "hi")) "bye";; ;;; The word "hi" failed on "bye" Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 173 / 175

slide-174
SLIDE 174

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs 🗹 Learn of new PC operators 🗹 Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 174 / 175

slide-175
SLIDE 175

Further reading

🔘 Parsing Combinators

Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 175 / 175