[PPT] - Where we're at: Syntax analysis of VSL Things needed to Submit PowerPoint Presentation

SLIDE 1

Where we're at: Syntax analysis of VSL

Things needed to

– Submit homework (pdfs and tarballs) – Build programs (make, cc) – Build scanners (Lex/flex) – Build parsers (Yacc/bison) – Build symbol tables (hashtables/libghthash) – Assemble machine code (as)

In PS3, we will see our Very Simple Language take shape, in terms of its

syntactic structure (i.e., “which words go where”).

Things are getting a bit more complicated, but it's where the fun begins.

SLIDE 2

ps3_skeleton: The sequence of things

main calls yyparse() (which is generated by yacc from src/parser.y)
yyparse() creates a tree of node_t structs (TODOs in src/parser.y, tree.c),

and assigns the root node to the globally declared 'root' if parsing succeeds

If compiled with the macro DUMP_TREES=1 or DUMP_TREES=3, then

main prints a text representation of the tree at 'root' on stderr

Main calls simplify_tree (TODO in tree.c) to prune away a few features

from the syntax tree which are only convenient in syntax analysis (and would cause extra headache when code generation time comes)

If compiled with the macro DUMP_TREES=2 or DUMP_TREES=3, then

main prints a text representation of the (now simplified) tree on stderr

Main takes down the tree with destroy_subtree (TODO in src/tree.c)

SLIDE 3

Yacc specifications in general

Much like Lex specifications, Yacc specs contain definitions, rules and

function implementation sections, separated by %%

I will walk you through the definitions
The rules are not reg. exps any more, they are grammar productions in a

format like left_nonterm : nonterm token nonterm nonterm {/*action 1*/} | nonterm token token { /*action 2, one NT and 2 tokens */ } | { /* action 3 – epsilon productions don't need a right hand side } | token { /* action 4 – just a token */ } ; ← semicolon ends a string of productions with same l.h.s.

With the “return yytext[0]” final rule of the scanner, single characters act as

their own tokens, so they can appear as literals (e.g. '+' or '}') in productions.

SLIDE 4

The declarations section

The 'extern' declarations of yytext and yylineno just mean that we rely on the linker

to find them in the object code for the scanner.

'node_t * root;' is a global declaration for a struct which will be assigned as the last

thing the parser does – it's where main will get a hold of the parser's work

Prototypes for yylex and yyerror are just to point at implementations elsewhere in the
framework. yyerror is Yacc's callback for syntax errors, the implementation at the

bottom just stops the program dead in it's tracks (which will do for us).

The %token directives name all the tokens we want in the header file shared with the
scanner. These are just magic integers.
%left sets the associativity of operators (there's a %right as well), breaking it across

multiple lines orders operator precedence.

There's an operator UMINUS with no associativity and high precedence – this is a

placeholder it's purpose is on the next slide

SLIDE 5

Regarding UMINUS

'-' can act a bit funny as an operator; when it's part of a binary

expression, it has precedence like '+', but when it's unary (like in “- 123”) it binds tighter than everything.

To let Yacc work with this, we need to pretend that there's another

token (it need not be returned from anywhere to have a precedence)

Operator precedence can be set by associating the grammar rule with

a token's precedence even if the token itself does not appear in the production: the rules expression '-' expression { /* action goes here / } '-' expression %prec UMINUS { / action goes here */ } will handle the first rule according to the 'default' precedence of binary minus, and associate the precedence of the ethereal UMINUS with the second rule.

SLIDE 6

First steps: tree node structure

include/tree.h defines a structured type node_t, which will hold our

tree nodes, and has elements

– nodetype_t type (remember what kind of node this is) – void *data (to retain copies lexical information where it's

needed – integers, string literals and suchlike)

– void *entry (nevermind this for now, but we'll need it when

the time comes for a bit of semantics)

– uint32_t n_children (number of nodes this one links to

below – unsigned, can't be less than 0)

– node_t **children (list of pointers to the nodes below)

...the following figure illustrates how these structs are supposed to

link together

SLIDE 7

SLIDE 8

Shift/reduce parsing a la Yacc

The parser generated by yacc effectively traces out this tree for us, left-to-

right, bottom to top, pushing tokens onto an internal stack, and calling a production rule every time it can reduce the right hand side of a production into the nonterminal on the left. At the bottom, with two productions

– integer: NUMBER { /* This code is called when the scanner

returns a number */ }

– expression: integer { /* This code is called next, since the right-

hand-side of the rule only requires that we've had an integer */}

– What happens next depends on what has been recently seen; if

what's on the parser's internal stack was just missing an expression to complete the right hand side of a production, another rule will fire – otherwise, the scanner gets to fetch the next token, in the hope that something will match soon.

What we need to construct our tree is to build it inductively inside the

production's semantic action blocks (plain old C).

SLIDE 9

Where baby tokens come from

The skeleton for the parser really depends on a correct

scanner.

Since some late submissions for that exercise must be

admitted for a while still, I regrettably can't hand out the Lex spec. quite yet.

Instead, the skeleton code includes the C code for one generated

by lex. (It's technically possible to reconstruct the reg. exps. from the state table therein, but I reckon it is more work than figuring them out from scratch...)

For the nonce, it scans VSL sources – the recipe will be

included in future skeletons.

SLIDE 10

Referring to tokens and nonterminals in rules

As an example, consider the production rule

integer : NUMBER { /* Code */ }

To construct a node_t from this, we need

– The lexeme which was scanned for NUMBER (Parser

knows about yytext, can read it directly)

– A dynamically allocated (node_t) to fill in the data – A way to assign it to the 'integer' nonterminal which is

passed on up.

$1 means “first token or nonterminal on the right hand side”
$2 is the second one, and so on...
$$ is the left hand side

SLIDE 11

Productions and type information

It's good that we can refer to the parts of a production, but yacc needs to know that

$1 in this case is an integer (the NUMBER token value), and that we want $$ to be a (node_t *).

The %token directives in the declarations section says which names we want to refer

to (more or less arbitrary) integer token values

The definition of YYSTYPE at the very top says that we want all nonterminals to be
f type (node_t *)

–

It's possible to type them all explicitly, but we will have better use for a perfectly regular tree, so everything is a node structure

Thus, $$ = (node_t *) malloc ( sizeof(node_t) ); in the integer rule above will pass

upwards a pointer, and the code in the rule can fill it in with “0 children”, NULL pointers where need be, and the integer value read from the lexeme (which can be found by “strtol(yytext, NULL, 10);” )

$$ is “some pointer to a node_t struct”, so “$$->type = integer_n;” etc. are perfectly

valid statements.

SLIDE 12

Aside: State of the union

The type of nonterminals is where yacc likes unions.
For internal reasons, all nonterminals in the generated code are declared

with YYSTYPE

With the %union directive, you can define YYSTYPE to be a union of any

number of types, e.g. %union { double dval; int ival; } will permit tokens to be typed, as in %token<dval> DOUBLE, and “$$.dval = (double)($1.ival) + 3.14;” will make sense for a production which needs to add an int and a double (without abandoning all kinds of type checking)

We won't be needing it right here, but thinking in terms of type-generic

logic is a healthy mental exercise for any programmer.

The utility of yacc goes far beyond building compilers, so now you heard

this in case you will need it.

SLIDE 13

Building blocks

Small as the language is, there are 48 productions in VSL.
If you need 8 lines of code per production, that is already

384 lines of parser with a lot of similar malloc-ing, not even counting whitespace, comments.

This is not hard, but it's more typing than is pleasant, and

it's horrible to change if you need it.

Therefore, turning node genesis and destruction into one-

liners is a positive boon

That's what the auxiliary routines in src/tree.c are for.

SLIDE 14

Node genesis: (node_t *) node_init (...)

node_t * node_init ( node_t *nd, nodetype_t type, void *data, uint32_t n_children, … )
The idea here is to have a function for creating node_t structs in a jiffy, their

contents are pretty regular

However; there are variable #s of children
For this, we can apply the <stdarg.h> and va_list constructs, as discussed in

recitation 1: this will permit writing things like return_statement : RETURN expression { $$ = node_init ( malloc(sizeof(node_t)), return_statement_n, NULL, 1, $2 ); } ; to make the l.h.s. point at a struct on heap with NULL for the data pointer, and 1 child which points to the expression node from the r.h.s.

(The type already says that it's a return statment, so here we can toss the token)

SLIDE 15

Node destruction: node_finalize ( node_t *discard )

This one's pretty simple – just deallocate everything which is

dynamically allocated inside

If you just apply this at the top, it will leave the entire tree

below dangling on the heap with no reference

It's intended for use when editing the internals of the tree,

taking down the whole thing will require a bit of recursion

SLIDE 16

Tree destruction: destroy_subtree ( node_t *r )

A node is the only path to its children, so

– Recursively destroy all the children – Then take out the node you are looking at, and

return

This is for removing everything at the end, and

it's a light start on recursively manipulating the tree

node_print does a recursive traversal in the
pposite order (handle this one before descent),

it can work as an example.

SLIDE 17

The node types in 'nodetypes.c/h'

src/nodetypes.c is really nothing but a block of initialized and immutable

structs: in short, it's a block of data which doesn't have to change.

Each nodetype_t struct consists of a (named) magic number, and a string

which says the same thing.

This is just to find our way in the tree: the magic integer is good for

switching/branching on, the string makes it easy to print the structure of the tree

(This part is the place where a bit of object orientation could have simplified

things a little, the type of a tree node could have been encoded in it's class instead...)

Making this thing is more tedium than learning, so you get mine free.

SLIDE 18

Rules of thumb: NULL, and dynamic everything

Set all unused pointers to NULL – free(NULL)

is defined as a no-op, this way everything can be freed using the same logic whether there is something there or not.

Allocate every little thing on heap – even if it

seems wasteful to malloc space for a 32bit integer (at the data pointer of an integer node), this will again make the tree very regular, and thus a little more consistent to work with.

SLIDE 19

The weekly macro abuse

What's in yytext is very temporal – blink and you'll miss it.
For (mostly) this reason, it's good to heap-allocate copies of strings needed

later, to annotate the syntax tree with.

POSIX-ly pretty C doesn't include functions which heap-allocate internally
Heap-copying strings is convenient enough that it's found in several

compiler-specific extensions to the standard, however – you can often find a function 'strdup' which does it.

In the name of portability, we can roll our own:

#define STRDUP(s) strncpy ( malloc ( strlen(s)+1 ), s, strlen(s)+1 ) does the trick (assuming that s is 0-terminated), and is found in include/tree.h

char *mystring = STRDUP(yytext); // Get heap copy of lexeme buffer
Expressions come in many varieties, keep a copy of the operator in the node

SLIDE 20

Tip: start from the bottom!

Write up the utility routines in tree.c and test that you can build little

trees with them correctly

The present grammar in parser.y says that only whitespace and

comments produce programs. Rather than typing up the entire grammar in one go, it's easier to start with producing integers, identifiers, etc., then moving on to expressions, statements, and the works...

Once you get the first few right and the principle is clear, the rest is

mostly typing.

SLIDE 21

And once we have the tree...

The grammar is very conveniently written in BNF
Some of the implications for the tree structure are not as

stunningly beautiful:

A variable_list contains a variable, and a

variable_list, which contains a variable and a

variable_list, which contains a variable and so on

...all the way to where there's just 1 var.
Once the recursive def. has constructed our list, we know how many

elements there are, say N

This means that the entire subtree can be flattened to one variable_list

node with N children, and become easier to manipulate later

SLIDE 22

More artifacts

Mind how you handle the declaration_list, it's a little different from the
thers
Some of the productions are only there to do such things as make a variable

list optional (as in the parameter_list production)

The node types statement, argument_list and parameter_list are no longer

valuable when syntax analysis is complete, they can be removed.

The value of an expression like “(46+2)*5” already has a known value at

compile time, we can rewrite it from a subtree to a single integer node

(The 'null_statement' is a little inaptly named, but it has semantics, so we're

keeping it...)

SLIDE 23

Simplifying the tree

This is another recursive traversal:

– First, descend into what's below – When that has been simplified, treating the place where

you are is a switch on the node type. Recognizing the structures which can be re-written is a matter of

examining the list of children
creating (and linking in) an equivalent node
removing the old one (which is where the non-

recursive node_finalize comes in)

In general, it's easiest to work by depth-first traversal, from the

bottom up: in this manner, only subtrees of 0, 1 or 2 nodes have to be recognized at a time.

SLIDE 24

Couldn't a lot of this be done in the parser?

Yes, it could.
We don't really have a speed problem on this scale – we

can afford an extra tree traversal.

Manipulating the tree makes a good exercise in

understanding its construction.

The parser is dense enough as it is.
Doing this separately, it is possible to complete a working

parser without getting all the simplifications right

SLIDE 25

Final words on testing

As before, doing things incrementally is smart – running the test case out of the box

will just produce an error message. (We hijack stderr to dump trees to see what's going on, before it's working 'vsl_programs/simplify.tree' just fills with the syntax error from the incomplete parser)

Making 'test' checks a tree dump against pre-generated ones
What I do to check my own solution:

–

make purge start with a clean tree

–

export CFLAGS=-DDUMP_TREES=1 && make test (expect an OK message that this was the full tree from 'simplify.vsl' under vsl_programs)

–

make purge need to build again to dump other tree

–

export CFLAGS=-DDUMP_TREES=2 && make test (expect an OK message that this was the simplified tree from that file)

(May not be a shining example of automation, but it should work)