Compiler Construction
Mayer Goldberg \ Ben-Gurion University October 31, 2018
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 1 / 175
Compiler Construction October 31, 2018 Compiler Construction - - PowerPoint PPT Presentation
Compiler Construction October 31, 2018 Compiler Construction October 31, 2018 1 / 175 Mayer Goldberg \ Ben-Gurion University Mayer Goldberg \ Ben-Gurion University Chapter 2 Goals October 31, 2018 Compiler Construction 2 / 175 Agenda
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 1 / 175
▶ The pipeline of the compiler ▶ Introduction to syntactic analysis ▶ Further steps in ocaml
▶ The pipeline
▶ Syntactic analysis ▶ Semantic analysis ▶ Code generation
▶ The compiler for the course ▶ The language of S-expressions ▶ More ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 2 / 175
▶ The interpreter as an evaluation function ▶ The compiler as a translator & optimizer ▶ We explored the relations between interpretation & compilation
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 3 / 175
▶ Understanding the syntax of the program
▶ What kinds of statements & expressions there are ▶ What are the various parts of these statements & expressions ▶ Are they syntactically correct
▶ Understanding the meaning of the program
▶ Do the operations make sense? ▶ What are their types? ▶ Are they used in accordance with their types? ▶ On what data is the program acting? ▶ What is returned?
▶ Once we understand the syntax and meaning of a sentence, we
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 4 / 175
▶ Syntactic analysis
▶ Scanning ▶ Parsing ▶ Reading ▶ Tag-Parsing
▶ Semantic analysis ▶ Code generation
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 5 / 175
▶ Function: What they do ▶ Dependencies: Which stages depend on which other ▶ Complexity: How diffjcult it is to perform a stage
▶ Understanding syntax is relatively straightforward (unlike in
▶ Understanding meaning is way harder than understanding syntax ▶ Meaning is built upon syntax (in natural languages, syntax &
▶ Code generation is relatively straightforward (template-based)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 6 / 175
▶ We distinguish [at least] two levels of optimizations:
▶ High-level optimizations (closer to the source language) would
▶ Low-level optimizations (closer to assembly language) would go
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 7 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 8 / 175
▶ The test during run-time has been eliminated ▶ The code is shorter ▶ Possibly lead to further, cascading optimizations
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 9 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 10 / 175
▶ Saved 1 cycle ▶ Made the code smaller ▶ If this code appears within a loop, gains shall be multiplied…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 11 / 175
▶ Concrete syntax ▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 12 / 175
▶ It’s one-dimensional ▶ Lacking in structure
▶ No nesting ▶ No sub-expressions
▶ Diffjcult to work with
▶ Diffjcult to access parts ▶ Diffjcult to determine correctness
▶ Contains redundancies (spaces, comments, etc)
▶ A text fjle ▶ Characters typed at the prompt
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 13 / 175
▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 14 / 175
▶ Multi-dimensional ▶ Conveys structure
▶ Nested ▶ Recursive (following the inductive defjnition of the grammar)
▶ Easier to work with than the concrete syntax
▶ Easier to access parts ▶ Easier to verify correctness ▶ Some syntactic correctness issues have already been decided Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 15 / 175
▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 16 / 175
▶ The AST is a tree ▶ No text ▶ No parenthesis ▶ No spaces, tabs, newlines ▶ The structure is evident ▶ Easy to fjnd
▶ Easy to determine
▶ Easier to analyze,
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 17 / 175
▶ Parsing: going from concrete syntax to abstract syntax ▶ Parser: the tool that performs parsing
▶ Lacks structure ▶ Prone to errors ▶ Hard to delimit
▶ Ineffjcient to work with ▶ Concrete Syntax can be
▶ Visual languages ▶ Structure/syntax editors
▶ Has structure ▶ Many kinds of errors are
▶ Sub-Expressions are readily
▶ Effjcient to work with
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 18 / 175
▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 19 / 175
▶ The smallest, meaningful, lexical unit in a language ▶ Described using regular expressions ▶ Identifjed using DFA (a very simple model of computation) ▶ Examples
▶ Numbers ▶ [Non-nested] Strings ▶ Names (variables, functions) ▶ Punctuation
▶ Cannot handle nesting of any kind:
▶ Parenthesized expressions ▶ Nested comments ▶ etc. Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 20 / 175
▶ Scanning: going from characters into tokens ▶ Scanner: the tool that performs scanning ▶ Scanner generator: the tool that takes defjnitions for tokens,
▶ Examples of scanner-generators: lex, fmex
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 21 / 175
▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 22 / 175
▶ Delimiters are characters that separate tokens ▶ In most languages spaces, parentheses, commas, semicolons,
▶ Some tokens must be separated by delimiters
▶ Two consecutive numbers, two consecutive symbols, etc.
▶ Some tokens do not need to be separated by delimiters
▶ Two consecutive strings, an open parenthesis followed by a
▶ Delimiters are language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 23 / 175
▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 24 / 175
▶ Whitespace refers to characters that
▶ Have no graphical representation ▶ Occur between tokens ▶ Spaces within strings are not whitespaces… ▶ Serve no syntactic purpose other than as delimiters and for
▶ Whitespace is language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 25 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 26 / 175
▶ Delimiters & whitespaces ▶ Parentheses, brackets, braces, and other grouping and nesting
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 27 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 28 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 29 / 175
▶ Grammars with which to express the syntax of the language
▶ There are difgerent kinds of grammars (CFG, CSG, two-level,
▶ There are difgerent languages for expressing syntax in a
▶ Algorithms for parsing programs as per kind of grammar ▶ Techniques (e.g., parsing combinators, DCGs)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 30 / 175
▶ Going from characters to tokens ▶ Identifying & grouping characters into tokens for words,
▶ Parsing over tokens is more effjcient than parsing over
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 31 / 175
▶ In LISP/Prolog, the parser is split into two components:
▶ The reader, or the parser for the data language ▶ The tag-parser, or the parser for the source code
▶ In LISP/Scheme/Racket/Clojure/etc, the abstract syntax for
▶ In Prolog, the abstract syntax for the data is the abstract syntax
▶ Prolog is the programming language with the most powerful
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 32 / 175
▶ In programming languages in which the syntax of code is not a
▶ In programming languages in which the syntax of code is part of
▶ The concrete syntax of data is a stream of characters ▶ The concrete language of code is the abstract syntax of the
▶ In Scheme, the language of data is called S-expressions (more on
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 33 / 175
▶ The tag-parser takes sexprs and returns [ASTs for] exprs ▶ Languages other than from the LISP & Prolog families do not
▶ In such languages, parsing goes directly from tokens to [ASTs
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 34 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 35 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 36 / 175
▶ Annotate the ASTs ▶ Compute addresses ▶ Annotate tail-calls ▶ Type-check code ▶ Perform optimizations
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 37 / 175
▶ Generate a stream of instructions in
▶ assembly language ▶ machine language ▶ Build executable ▶ some other target language…
▶ Perform low-level optimizations
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 38 / 175
▶ Written in ocaml ▶ Support a subset of Scheme + extensions ▶ Support two, simple optimizations ▶ Compile to x86/64 ▶ Run on linux
▶ Support for the full language of Scheme ▶ Support for garbage collection ▶ Self-compilation
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 39 / 175
▶ We’re going to learn about syntax by studying the syntax of
▶ After all, we’re writing a Scheme compiler… ▶ It’s relatively simple, compared to the syntax of C, Java,
▶ It comes with some interesting twists
▶ Scheme comes with two languages:
▶ A language for code ▶ A language for data
▶ The key to understanding the syntax of Scheme, is to think
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 40 / 175
▶ Describe arbitrarily-complex data
▶ Possibly multi-dimensional, deeply nested ▶ Polymorphic
▶ Access components easily and effjciently
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 41 / 175
▶ S-expressions (the fjrst: 1959) ▶ Functors (1972) ▶ Datalog (1977) ▶ SGML (1986) ▶ MS DDE (1987) ▶ CORBA (1991) ▶ MS COM (1993) ▶ MS DCOM (1996) ▶ XML (1996) ▶ JSON (2001)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 42 / 175
▶ They’re the fjrst… 😊 ▶ They’re supported natively, as part of specifjc programming
▶ S-expressions are supported by LISP-based languages, including
▶ Functors are supported by Prolog-based languages
▶ The language of programming is a [strict] subset of the language
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 43 / 175
▶ It’s not supported natively by any programming language ▶ Most modern languages (Java, Python, etc) support it via
▶ No programming language is written in XML:
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 44 / 175
▶ Supported XML as its data language ▶ Were itself written in XML
▶ Writing interpreters, compilers, and other language-tools would
▶ Refmection (code examining code) would be simple
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 45 / 175
▶ They are the data language for LISP-based languages, including
▶ LISP-based languages are written using S-expressions ▶ Writing interpreters and compilers in LISP-based languages is
▶ Computational refmection was invented in LISP! ▶ This is the real reason behind all these parentheses in Scheme:
▶ A very simple language ▶ Supports core types: pairs, vectors, symbols, strings, numbers,
▶ A syntactic compromise that is great for expressing both code
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 46 / 175
▶ S-expressions were invented along with LISP, in 1959 ▶ S-expressions stand for Symbolic Expressions ▶ The term is intended to distinguish itself from numerical
▶ Before LISP (and long after it was invented), most computation
▶ Computers languages were great at ”crunching numbers”, but
▶ String libraries were non-standard and uncommon ▶ Polymorphic data was unheard of ▶ Nested data structured needed to be implemented from scratch,
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 47 / 175
▶ Working with data structures became considerably simpler
▶ Trivially allocated (no pointer arithmetic) ▶ Polymorphic (lists of lists of numbers and strings and vectors of
▶ Easy to access sub-structures (no pointer arithmetic) ▶ Easy to modify (in an easygoing, functional style) ▶ Easy to redefjne ▶ Automatically deallocated (garbage collection)
▶ Treating code as data became considerably simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 48 / 175
▶ Symbolic Mathematics (Macsyma, a precursor to Wolfram
▶ Artifjcial Intelligence ▶ Computer adventure game generation languages (MDL, ZIL)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 49 / 175
▶ The empty list: () ▶ Booleans: #f, #t ▶ Characters: #\a, #\Z, #\space, #\return, #\x05d0, etc ▶ Strings: "abc", "Hello\nWorld\t\x05d0;hi!", etc ▶ Numbers: -23, #x41, 2/3, 2-3i, 2.34, -2.34+3.5i ▶ Symbols: abc, lambda, define, fact, list->string ▶ Pairs: (a . b), (a b c), (a (2 . #f) "moshe") ▶ Vectors: #(), #(a b ((1 . 2) #f) "moshe")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 50 / 175
▶ The name LISP comes from LISt Processing. ▶ In fact, LISP has no direct support for lists. ▶ LISP has ordered pairs
▶ Ordered pairs are created using cons ▶ The fjrst and second projections over ordered pairs are car and
▶ (car (cons x y)) ≡ x ▶ (cdr (cons x y)) ≡ y ▶ The ordered pair of x and y can be written as (x . y) Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 51 / 175
▶ Rule 1: For any E, the ordered pair (E . ()) is printed as (E),
▶ Rule 2: For any E1, E2, …, the ordered pair (E1 . (E2 — )) is
▶ These rules just efgect how pairs are printed ▶ These rules give us a canonical representation for pairs
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 52 / 175
▶ The pair (a . (b . c)) is printed as (a b . c)
SYMBOL a SYMBOL b SYMBOL c PAIR CAR CDR PAIR CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 53 / 175
▶ The pair ((a . (b . ())) . ((c . (d . ())))) is
SYMBOL a SYMBOL b NIL PAIR CAR CDR PAIR CAR CDR SYMBOL c SYMBOL d NIL PAIR CAR CDR PAIR CAR CDR NIL PAIR CAR CDR PAIR CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 54 / 175
▶ Lists in Scheme can come in two forms, proper lists and
▶ When we just speak of lists, we usually mean proper lists. ▶ Most of the list processing functions (length, map, etc) all work
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 55 / 175
▶ Proper lists are nested ordered pairs the rightmost cdr of which
▶ Testings for pairs is cheap, and is done by means of the builtin
▶ Testing for lists is expensive, since it traverses nested, ordered
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 56 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 57 / 175
▶ Pairs that are not proper lists are improper lists. ▶ Improper lists end with a rightmost cdr that is not nil ▶ List-processing procedures such as length, map, etc., do not
▶ There is no builtin procedure for testing improper lists, but it
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 58 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 59 / 175
▶ Entering an empty list or a vector or an improper list at the
▶ Entering a symbol at the prompt causes Scheme to attempt to
▶ Entering a proper list, that is not the empty list, at the prompt
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 60 / 175
▶ The special form quote can be written in two ways:
▶ (quote <sexpr>) ▶ '<sexpr>
▶ When you type abc at the Scheme prompt, you’re evaluating
▶ When you type 'abc at the Scheme prompt, you’re evaluating
▶ The value of the literal symbol abc is just itself, which is why
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 61 / 175
▶ When you type () at the Scheme prompt, you’re evaluating an
▶ When you type '() at the Scheme prompt, you’re evaluating a
▶ The value of the literal empty list is just itself, which is why
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 62 / 175
▶ When you type (a b c) at the Scheme prompt, you’re
▶ When you type '(a b c) at the Scheme prompt, you’re
▶ The value of the literal list (a b c) is just (a b c), which is
▶ Quoting a self-evaluating S-expression is possible, and
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 63 / 175
▶ The quote form does nothing
▶ It is not a procedure ▶ It doesn’t take an argument ▶ It delimits a constant, literal S-expressions
▶ The syntactic function of quote in Scheme is the same as the
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 64 / 175
▶ Simlarly to quote, the form quasiquote can be written in two
▶ (quasiquote <sexpr>) ▶ `<sexpr>
▶ quasiquote is also used to defjne data:
▶ `abc is the same as 'abc ▶ `(a b c) is the same as '(a b c)
▶ But quasiquote has two neat tricks!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 65 / 175
▶ The following two forms may occur within a
▶ The unquote form: ▶ (unquote <sexpr>) ▶ ,<sexpr> ▶ The unquote-splicing form: ▶ (unquote-splicing <sexpr>) ▶ ,@<sexpr>
▶ Both unquote & unquote-splicing are used to mix in
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 66 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 67 / 175
▶ The expression `(a ,(append '(x y) '(z w)) b) is
▶ The expression `(a ,@(append '(x y) '(z w)) b) is
▶ The difgerence between unquote & unquote-splicing is that
▶ unquote mixes in an expression using cons ▶ unquote-splicing mixes in an expression using append Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 68 / 175
▶ Together, quasiquote, unquote, & unquote-splicing are
▶ The quasiquote mechanism allows us to create data by
▶ In Scheme, convenient ways to create data translate
▶ Therefore we expect the quasiquote mechanism to have useful
▶ We can turn code that computes something into code that
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 69 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 70 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 71 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 72 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 73 / 175
▶ The 2nd and 3rd ribs of the cond overlap [we could have
▶ All atoms are left unchanged ▶ All pairs are duplicated, while recursing over the car and cdr of
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 74 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 75 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 76 / 175
▶ Using the quasiquote mechanism, we got foo to describe how
▶ We should really add support for proper lists and vectors! ▶ In fact, the name describe is far more appropriate than foo…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 77 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 78 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 79 / 175
▶ (quote <sexpr>) ≡ '<sexpr> ▶ (quasiquote <sexpr>) ≡ `<sexpr> ▶ (unquote <sexpr>) ≡ ,<sexpr> ▶ (unquote-splicing <sexpr>) ≡ ,@<sexpr>
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 80 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 81 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 82 / 175
▶ The fjrst element of the list is the symbol quote ▶ The second element of the list is '''''''''''''''moshe
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 83 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 84 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 85 / 175
▶ Types ▶ References ▶ Modules & signatures ▶ Functional programming in ocaml Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 86 / 175
▶ Defjning new data types ▶ Assignments, side-efgects,
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 87 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 88 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 89 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 90 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 91 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 92 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 93 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 94 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 95 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 96 / 175
▶ References are derived types. For any type α, we can have a
▶ References are records with a single fjeld contents ▶ References have a special syntax ! to dereference the fjeld:
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 97 / 175
▶ References have a special syntax := for assignment ▶ This is how assignments are managed in ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 98 / 175
▶ It is not possible to perform assignments on variables ▶ It is only possible to change the fjelds of reference types
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 99 / 175
▶ You can defjne a reference type of any other type, including
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 100 / 175
▶ A module is a way of packaging functions, classes, variables, &
▶ A signature is the type of a module
▶ Visibility of a module can be restricted through the signature
▶ Functors are functions from functors/modules to
▶ Learn to work with existing modules ▶ Learn to write your own modules
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 101 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 102 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 103 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 104 / 175
▶ M.hyp is visible from outside M ▶ M.square is not visible from outside M ▶ Functions visible from outside may use functions visible from
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 105 / 175
▶ Modules & signatures are the way to package functions &
▶ Convenient, super-effjcient, safe ▶ No need to use local, nested functions to manage visibility ▶ Always use signatures to control visibility!
▶ Modules can contain types too, and be used to parameterize
▶ Simpler & better than generics & templates
▶ Functors map modules/functors =
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 106 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 107 / 175
▶ Parsing algorithms are tailored to a specifjc kind of grammar
▶ Difgerent kinds of grammars can be parsed by difgerent
▶ Difgerent kinds of grammars have difgerent levels of complexity
▶ Most programming languages can be described using
▶ Some older languages can only be described using
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 108 / 175
▶ V is a set of non-terminals ▶ Σ is a set of terminals, or tokens ▶ R is a relation in V × (V ∪ Σ)∗
▶ Members of R are called production rules or rewrite rules
▶ S is the an initial non-terminal
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 109 / 175
▶ We abbreviate the two productions ⟨A, X⟩ , ⟨A, Y⟩ ∈ R with
▶ We abbreviate the three productions ⟨A, X⟩ , ⟨X, ε⟩ , ⟨X, BX⟩ ∈ R,
▶ We abbreviate the three productions
▶ We abbreviate the two productions ⟨A, ε⟩ , ⟨A, B⟩ ∈ R, with
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 110 / 175
▶ Start with the initial non-terminal ▶ Rewrite the LHS of a non-terminal with its RHS, matching the
▶ Keep rewriting until the entire input stream is matched
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 111 / 175
▶ Start with the input stream of tokens ▶ Find a rewrite rule where the RHS matches sequences in the
▶ Keep rewriting until the entire input stream has been reduced to
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 112 / 175
▶ Describe the grammar of the language using a DSL for some
▶ Example: Backus-Naur Form (BNF)
▶ Associate actions with each production rule:
▶ How to build the AST when a specifjc rule is matched
▶ A parser generator (e.g., yacc, bison, antlr, etc) compiles the
▶ Performing various optimizations ▶ Generating code in some language (C, Java, ocaml, etc) ▶ This code is the parser
▶ Calling the parser on some input returns an AST
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 113 / 175
▶ Minimal restrictions on the grammar ▶ Avoid backtracking as much as possible ▶ Maximum optimizations of the parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 114 / 175
▶ Parsers for larger languages are composed from parsers for
▶ The grammar can be written & debugged bottom-up ▶ The parsers are fjrst-class objects:
▶ We get to use abstraction to create complex parsers quickly &
▶ Re-use efgectively common sub-languages
▶ Simple to understand & implement ▶ Very rapid development
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 115 / 175
▶ The grammar is embedded as-is:
▶ As much backtracking as implied by the grammar: Rewrite
▶ No optimizations or transformations are performed on it!
▶ ε-productions & left-recursion result in infjnite loops
▶ We need to eliminate these manually!
▶ Can produce ineffjcient parsers rather effjciently! 😊
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 116 / 175
▶ Parsing combinators are very simple to learn about grammars:
▶ No complex algorithms are necessary! ▶ The easiest way to design complex grammars & their parsers:
▶ shortens & simplifjes the code ▶ encourages re-use & consistency
▶ Optimizations can always be done manually!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 117 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 118 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 119 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 120 / 175
▶ We build parsers of large languages by combining parsers for
▶ The procedures that combine parsers are called parsing
▶ But we must start by being able to parse single characters
▶ All other parsers are built on top of such simple parsers for
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 121 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 122 / 175
▶ …takes a list of characters ▶ …returns a pair of what it matched, and the remaining characters
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 123 / 175
▶ We only match the head of the input ▶ Obviously, ntA fails on an empty list
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 124 / 175
▶ Testing our parsers by applying them to lists is no fun
▶ It’s a pain to type lists of characters!
▶ Let’s automate things a bit:
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 125 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 126 / 175
▶ We try to parse the head of s using nt1
▶ If we succeed, we get e1 and the remaining chars s ▶ We try to parse the head of s (what remained after nt1) using
▶ If we succeed, we get e2 and the remaining chars s ▶ We return the pair of e1 & e2, as well as the remaining chars Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 127 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 128 / 175
▶ We try to parse the head of s using nt1
▶ If we succeed, then the call to nt1 returns normally ▶ If we fail we try to parse the head of s using nt2 Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 129 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 130 / 175
▶ Some simple parsers ▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 131 / 175
▶ nt_epsilon is the parser that recognizes ε-productions ▶ nt_none is the parser that always fails ▶ nt_end_of_input is the parser that recognizes the end of the
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 132 / 175
▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 133 / 175
▶ What is the unit element of catenation?
▶ Answer: r = ε ▶ We’re looking for a non-terminal r such that for any s, we have
▶ This means that nt_epsilon is the unit element for caten: ▶ caten nt_epsilon nt ≡ caten nt nt_epsilon ≡ nt ▶ Both nt_epsilon & nt_end_of_input are used ’til the end of
▶ The natural operation is to create a list of all things until ε or
▶ The unit element for append on lists is the empty list ▶ Ergo, it is natural to match [] when either condition is
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 134 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 135 / 175
▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 136 / 175
▶ We want to be able to create an AST for that piece of syntax ▶ We do this by specifying postprocessing or callback functions
▶ In our package, the PC that performs this is called pack
▶ pack takes a non-terminal nt and a function f ▶ returns a parser that recognizes the same language as nt ▶ …but which applies f to whatever was matched Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 137 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 138 / 175
▶ Grammars are often recursive or mutually-recursive:
▶ The non-terminal on the LHS of a production often appears on
▶ The non-terminal on the LHS of a production often appears in
▶ Currently, we are unable to describe recursive rules using PCs
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 139 / 175
▶ The non-terminal A ▶ The open-parenthesis token ▶ The close-parenthesis token ▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 140 / 175
▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 141 / 175
▶ The problem is not specifjc to parsing combinators.
▶ For example, you couldn’t defjne in Scheme:
▶ So how are recursive defjnitions possible at all?
▶ When you defjne a recursive function you are not using the
▶ You are using the address of the function before the function is
▶ Recursive functions are circular data structures:
▶ The language defjnition permits you to defjne these particular
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 142 / 175
▶ ”Wrap it in a lambda…”
▶ A thunk is a procedure that takes zero arguments ▶ Thunks are used to delay evaluation
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 143 / 175
▶ Notice the packing function (function (a, s) -> a :: s)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 144 / 175
▶ We got a list of digits, as opposed to a list of chars!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 145 / 175
▶ Notice the type of the parser: char list -> int * char list
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 146 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 147 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 148 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 149 / 175
▶ By now, our toolset of parsing combinators consists of
▶ const ▶ caten ▶ disj ▶ pack ▶ delayed
▶ We can handle recursive grammars ▶ We can create ASTs ▶ In principle, we can implement parsers for any language
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 150 / 175
▶ For any NT P, P∗ stands for the
▶ The point of the Kleene-star is
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 151 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 152 / 175
▶ For any NT P, P+ stands for the rule Pplus defjned as follows:
▶ The point of the Kleene-plus is to recognize the catenation of
▶ Kleene didn’t really invent the Kleene-plus
▶ Rather, Kleene-plus is a natural extension of Kleene-star Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 153 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 154 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 155 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 156 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 157 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 158 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 159 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 160 / 175
▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 161 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 162 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 163 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 164 / 175
▶ Take a string of chars, and convert it to a list ▶ Map over each character in the list, creating a parser that
▶ Perofrm a right fold over that list using the caten operation
▶ The unit element is the unit element of catenation, namely
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 165 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 166 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 167 / 175
▶ Very similar to word:
▶ We use disj rather than caten ▶ The unit element for disj is nt_none
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 168 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 169 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 170 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 171 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 172 / 175
▶ The PC trace_pc is a wrapper (using the decorator pattern)
▶ The trace_pc PC takes a documentation string and a parser,
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 173 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 174 / 175
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 175 / 175