Making the interpreter

It was quite fun to write a BASIC interpreter in C++11 and BASIC itself is a simple language which recalls memories from the early days of personal computing, where each computer - such as the glorious Commodore 64 - had one of those embedded inside.

Main interpreter components 

To write nuBASIC interpreter I followed the approach which mainly consists in parsing the BASIC source into a parse tree and then execute it, so I wrote the following main components, however they often do not have one to one correspondence with a single class or other C++ structure:
  • Tokenizer: breaks language string into tokens 
  • Expression Parser: transforms mathematical/logical/string expression into an internal evaluable tree object 
  • Expression Evaluator: evaluates expression tree object 
  • Statement Parser: creates an executable syntax tree 
  • Statement Executor: executes a BASIC statement (which can or cannot contain one or more BASIC expressions). 
  • Static program context: collects and handles program meta-data (needed to handle and execute multiple-lines BASIC constructs). 
  • Run-time program context: handles objects such as variable state, procedure stack, function return value and so on, dynamically generated during the program execution. 
  • Command Line Interface (CLI): a console oriented interface which is the main user interface of the interpreter. It has line-oriented editor which allows programmer to insert or modify program on-the-fly. 
  • Interpreter: executes CLI commands, runs programs handling program data and meta-data, including debugging stuff. 
  • Built-in function library: implements predefined functions used within expressions and statements. 
  • Language statement objects: the implementation of language statement such as control structures (e.g. If-Then-Else, For-Next, Do-Loop-While, etc.), I/O built-in procedures (Print, Input, Open, etc.) and so on. 
  • Syntax and run-time error handlers: helper classes used to handle C++ exceptions which implement syntax and run-time language errors 
  • Built-in help: an interactive guide containing information about interpreter commands and language statements that a user can query via specific CLI commands.


A line-oriented language interpreter

BASIC language is line-oriented and this produces one of the key differences between BASIC and other programming languages where the position of hard line breaks in the source code is irrelevant.

Each code line in a BASIC program forms a self-contained unit.

For this reason nuBASIC interpreter is itself line-oriented: program source text is split into lines which are owned from Interpreter class (source lines are stored in a map of pairs <line-number, text-line>).

Each code line is parsed into self-contained execution unit.

The interpreter builds a static program context which represents the glue code among program lines (and statements).

Indeed, the Statement Parser recognizes complex language constructs although are split in different lines, and builds meta-data which refers them. Each control structures line can also contain more than one statement.

Handling the tokens

The parsing of each line is preceded by the separate lexical analysis provided by the Tokenizer, which creates tokens from the source text.

A token is implemented just as a class which contains original source text, token position within the text, length, type (token type can be one of the following: 'identifier', 'operator', 'literal string', 'integer', etc...) and other data.

For token representation I did not use a pure OOP approach, which generally classifies token types building a class hierarchy. I decided to use a flat representation of token type which is just an “enum-class” attribute of the token object. The rational is that homogeneous token objects are easier to collect and handle.

Parser uses the token type attribute to recognize and validate the language syntax and build statement objects and the meta-data that interpreter needs at run-time in order to execute a program.


Token list container

To reduce parser complexity, a Token List container class, wrapped around a standard Deque, has been provided.

Token list class adds some facility thought to make simple handling token lists and reduce parser implementation complexity.


Parsing the code


While an unique Tokenizer exists, more than one Parser has been implemented: 

  • Expression Parser analyzes an expression (such as “2+2*3/Sin(PI()/2)”) and produces an executable object instance. 
  • Statement Parser analyzes each source code line generating an executable statement object. 
    To complete this job the Statement Parser invokes the Expression Parser for each expression encountered in the source text in order to obtain its executable equivalent representation 
  • CLI Parser: the simplest of three parsers, is part of the Interpreter and it is responsible to validate and execute commands such as List, Run, Load, Save, and so on, which are not statements of the language but just interpreter environment commands.


One of main difficulties in parsing our language comes from expressions.

For instance, having parsed the beginning of an expression such as the following “2+4*17”, the resulting syntax tree for that depends on the whole expression, while knowing only the beginning of it, “2+4”, is not enough to build correctly the syntax tree.

Expression Parser has to recognize operators precedence and re-arrange the previous expression in something like this “(2 + (4 * 17))” in order to build a well-formed syntax tree, as shown below:

                        '+'

                        / \
                      '2' '*'
                          / \
                        '4' '17'

Expression parsing is done in different steps: 

  • Tokenizer breaks an expression (a string) down in tokens held from a token list object instance.
  • The token list is rearranged (special 'begin' and 'end' sub-expression markers are inserted into the list) to obtain the following operator precedence order:
    • Unary identity and negation (+, –). 
    • Exponentiation (^), Multiplication and floating-point division (*, /), Integer division (\, Div), Modulus arithmetic (Mod). 
    • Addition and subtraction (+, –). 
    • Comparison operators (=, <>, <, <=, >, >=). 
    • Logical Conjunction (And), Inclusive disjunction (Or), Exclusive disjunction (Xor), bit-wise operators 
  • Expression Parser builds an evaluable syntax tree where each node is an instance of a class belonging to the hierarchy formed by expr_any_t derived classes which represents: empty expressions, binary expressions, functions, variables and literal constants.


For example, considering following mathematical expression “2+4*17”, it becomes something like the following objects tree:

                            binary_expression
                                 (sum)
                                   / \
                             literal  \ 
                            (
integer)  \
                                2    binary_expression
                                      (multiplication)
                                            / \
                                           /   \
                                     literal   literal
                                     (integer) (integer)
                                          4     17



The abstract class expr_any_t defines the virtual method eval(), which sub-classes implement.

The prototype of eval() method is the following:
  • virtual variant_t eval(rt_prog_ctx_t & ctx) const = 0;

The method accepts a reference to run-time program (execution) context and returns a variant_t object, that we shall discuss next section.


Variant

Variant class is provided to manipulate several distinct BASIC language types in a uniform manner.
This reduces drastically the evaluator complexity.
I did not use C++ union because it supports only plain old data types and it is not adapt for non-trivial complex types.


Tracing execution of a simple program

YouTube Video


Let us consider the following simple BASIC program, containing just a unique line with a unique statement:

10 PRINT 2+4*17


Suppose you have already inserted the program so you have just type “RUN” to execute it.

First nuBASIC_console() function gets the command string “RUN” from standard input, then invokes the function exec_command() which is a helper function that invokes the related interpreter exec_command() method, catching any exceptions.

This method parses a command in order to recognize it and perform the action required.

In this case it calls the rebuild() interpreter method which clears static and run-time context objects (removing both dynamic data and meta-data), then for each source line calls the update_program() method. This method creates a Tokenizer object and calls the compile_line() method of the Statement Parser object, held as attribute of the Interpreter object.

compile_line() method uses the Tokenizer to break down the source line into a language token list like the following (for simplicity no all token fields are reported below):

{ (“PRINT”, identifier), (“ “, blank), (“2”, integer), (“+”, operator), (“4”, integer), (“17”, integer) }


Each line of code is treated as a “block” which is a container of statements. Thus the method parse_block() is first called. This method iterates while the token list is not empty calling for each iteration the method parse_stmt(), which is able to recognize the statement in order to select the specific parse_xxx() method().

In our example, it recognizes the token “PRINT” (which is the first token of the token list), and calls the specific parse_print() method and this builds a stmt_print_t object which holds an expression object instance. parse_print() method calls the template function parse_arg_list() which builds an expression list for the PRINT statement. Each item of the list is an expression object built via the Expression Parser, ready to be evaluate by its eval() method against a run-time program (execution) context.

Finally, parse_print() method returns to the calling parse_block(). Which in turn returns an handle to statement block object by means of a (smart standard) shared pointer to the class object.

The statement handle is stored in a map (prog_line_t) where the key element is just the processing line number.

After building the program, interpreter calls the run() method which creates a program_t object instance passing to it program line and program context objects coming from previous building phase.

The program object executes each code line (represented by block statement object as discussed before) by calling the related run() virtual method. In our example the unique program line (a block statement object) contains the print statement object. Calling its run() method the related print statement run() method is finally invoked.


stmt_print_t run() method evaluates each argument of its argument list. The argument list is just a collection of expression objects which export the eval() method. The eval() method returns a variant object that can be printed out to the standard output.

Comments