Implementation Stages
- Tokenization
String -> [Token]
- Makes defining and recognizing correct sequences easier
- Sometimes called lexing
- Parsing
[Token] -> ParseTree
- A tree that represents a successful parsing of a sequence of tokens
- (optional) AST Conversion
ParseTree -> AST
- (optional) Static Checks
AST -> AST
- See also: program analysis
- Evaluate
AST -> Result
- Run the input or generate code for it
- (optional) Dynamic Checks
See: compiler
Grammar Rules
e.g. for BNF, EBNF
- Generally matches left to right
- Single-quoted strings are literal
- Grammar rules end with semicolons
ANTLR Lexer
ANTLR Parser
- Parser grammar rules have lower-case non-terminal symbols
- Parser rule bodies can use both parser non-terminals and lexer ones
- Though, we should avoid doing this and keep parser and lexer rules separate
- Parser rule bodies may not include regex character classes (e.g.
[0-9]
or\d
)
3 Parsing Guidelines
- The grammar cannot be ambiguous: any given input string has at most one parse tree that accepts it
- No left recursion: each rule cannot start with itself (even indirect)
- Should not allow
T ::= T ...
- Should not allow
- Grammar must be locally deterministic: for each choice, we must be able to choose between them based only on the next token (avoid common prefixes, factor them out into separate rules)
Language Principles
- Learnability (how quickly can you pick it up; feels like “common sense”?)
- Efficiency (once you’ve learned it, how efficiently can you perform tasks)
- Memorability (coming back to the language, how easy to regain proficiency)
- Errors (how many do users make, how severe, how easily can they recover)
- Satisfaction (subjective, but very important for perseverance and adoption)
See also: software principles, design goals, programming models
- Maximize information hiding
- Make classes, members as private as possible
- Public classes should have no public fields (with the exception of constants)
- Don’t confuse users
- Keep things simple
- Name things well
- Keep things consistent
- Have good documentation
- Avoid unnecessary boilerplate
- Make it boring (intuitive, expected)
What is the purpose of a language?
- We think in a particular language and it determines how you think (see: Sapir-Whorf)
- Languages should help us think better
See also: notation
Evaluation
Recursive Evaluation
Each node has an evaluate method. Recursively traverse the tree and evaluate each node.
But this only supports a single type of traversing the AST. What if we want to support other types of checkers? We have many different operations that traverse the AST
Putting all functionality into AST methods violates SRP. We can instead, implement the visitor pattern.
Visitor Pattern
The visitor design pattern is a way of separating an algorithm from an object structure on which it operates.
Basically, you are passing this visitor object to a node’s accept function.
- Visitor defines a
visit
for each concrete node type to detail how to visit that node + its children (functionality depends on visitor) - If it needs to visit another node, it calls
accept
on itself (functionality also depends on the node type)
We perform double dispatch as the functionality depends on two things:
- the type of AST object (via the
accept
call) and - the type of visitor object (via the
visit
call)
We could just evaluate each AST node, but this places the responsibility on the nodes for how to do this.
- Support multiple kinds of “evaluation” for our AST without having to edit every node every node every time
- Evaluation is in a separate file from the AST implementation
- Create a
Visitor
interface under AST and define a bunch of visit methods for each concrete node type - Under the abstract
Node
class, create an abstractaccept
method - Create a new visitor class that implements the
Visitor
interface