Stork, An Example Programming Language, Lesson 1: Tokenization

  • strict warning: Non-static method view::load() should not be called statically in /home/sigpboo7/public_html/modules/views/views.module on line 842.
  • strict warning: Declaration of views_handler_field_comment::init() should be compatible with views_handler_field::init(&$view, $options) in /home/sigpboo7/public_html/modules/views/modules/comment/views_handler_field_comment.inc on line 50.
  • strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /home/sigpboo7/public_html/modules/views/plugins/views_plugin_row.inc on line 135.
  • strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /home/sigpboo7/public_html/modules/views/plugins/views_plugin_row.inc on line 135.

Welcome back!

For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 1 in the series, so if you’re just joining now, you haven’t missed much!

The code for this lesson is available on github under the tag lesson1, and you can follow the discussion about this lesson on reddit.

What is Tokenization?

If a programming language is a (more) convenient language humans can use to describe tasks and processes to computers, then tokenization is the process of turning a program’s raw program text into words, or “tokens,” that the computer can understand. For example, consider this simple Python program for the factorial function:

def factorial(n):
     if n == 0:
         return 1
     else:
         return n * factorial(n-1)

For this program text, the tokens would be: def, factorial, (, n, ), :, if, n, ==, and so on. Looking at tokens — atomic units of program semantics — as opposed to characters makes the next lesson’s topic of “parsing,” or discovering the semantic relationships among the different parts of the program text, much easier.

In a very real sense, the tokenizer defines the vocabulary of the programming language.

How does Tokenization Work?

Just like there are different kinds of words — nouns, verbs, adjectives, etc. — there are different kinds of tokens, like numbers, strings, and mathematical operators. Tokenizers are programs that convert the characters of program text into tokens.

Tokenizers are very simple programs. Essentially, a tokenizer skips stuff that’s not a token, then reads stuff that is a token until it can’t anymore, and repeats that over and over again until it runs out of stuff.

In more detail, a tokenizer “sees” three different kinds of characters: (1) characters that cannot be part of any token, like whitespace; (2) characters that are the initial character of a token; and (3) characters that are non-initial characters of a token. A tokenizer reads program text character by character. As it’s reading, it skips non-token characters, then inspects the first non-skippable character it finds. If this character is a legal initial token character, it proceeds to read a series of non-initial token characters until it has read the longest possible token, based on that initial character; otherwise, it raises an error indicating that it has found an illegal character. This approach works because the type of token to be read can be determined by inspecting the first token character the tokenizer sees: a letter for a symbol, a digit for a number, and so on1. This process is illustrated as a FSM below.

High-Level Tokenization FSM

Implementing Stork’s Tokenizer

Creating Stork’s tokenization engine requires the implementation of two primary abstractions: a tokenizer to actually do the character-to-token conversion, and the Token object that will actually be emitted. These abstractions are implemented in 3 new classes:

src/main/java/com/sigpwned/stork/io/ParseReader.javaReader with lookahead and other features src/main/java/com/sigpwned/stork/parse/Token.java — Token object, including token type descriptions src/main/java/com/sigpwned/stork/parse/Tokenizer.java — Converts program text to tokens

Implementing Tokens: Token.java

Tokens are implemented using a simple model class, which includes a couple of enumerations that represent the various flavors of tokens.

At a high level, Stork recognizes four meta types of tokens: (1) values, (2) keywords, (3) operators, and (4) “special” tokens, like EOF:

public static enum MetaType {
    VALUE, KEYWORD, OPERATOR, SPECIAL;
}

These meta types encompass a much greater number of actual token types:

public static enum Type {
    // Value Tokens
    INT(MetaType.VALUE, null),
    FLOAT(MetaType.VALUE, null),
    STRING(MetaType.VALUE, null),
    SYMBOL(MetaType.VALUE, null),

    // Keyword Tokens                                                       
    IF(MetaType.KEYWORD, "if"),
    WHILE(MetaType.KEYWORD, "while"),
    TRUE(MetaType.KEYWORD, "true"),
    FALSE(MetaType.KEYWORD, "false"),
    NULL(MetaType.KEYWORD, "null"),

    // Operator Tokens
    PLUS(MetaType.OPERATOR, "+"),
    MINUS(MetaType.OPERATOR, "-"),
    STAR(MetaType.OPERATOR, "*"),
    SLASH(MetaType.OPERATOR, "/"),

    // Special Tokens
    EOF(MetaType.OPERATOR, "$");

    private MetaType type;
    private String text;

    private Type(MetaType type, String text) {
        this.type = type;
        this.text = text;
    }

    public MetaType getType() {
        return type;
    }

    public String getText() {
        return text;
    }
}

A proper token is an aggregate of one of these token types and token text:

public class Token {
    public static enum MetaType {
        // See above...
    }

    public static enum Type {
        // See above...
    }

    private Type type;
    private String text;

    public Token(Type type, String text) {
        this.type = type;
        this.text = text;
    }

    public Type getType() {
        return type;
    }

    public String getText() {
        return text;
    }
}

Implementing Tokenization: Tokenizer.java

Stork’s tokenizer class uses a fairly direct implementation of the simple algorithm outlined above. The different token types are handled differently, but consider an alphabetical token as an example. The token type’s three character classes are defined by:

  1. Non-token characters: Character#isWhitespace
  2. Initial token characters: Character#isLetter
  3. Non-initial token characters: Character#isLetter

…and the tokenization code itself looks like:

if(Character.isLetter(getInput().peek())) {
    StringBuilder buf=new StringBuilder();
    while(Character.isLetter(getInput().peek()))
        buf.append((char) getInput().read());

    String text=buf.toString();
    if(keywordTypes.containsKey(text)) {
        Token.Type type=keywordTypes.get(text);
        result = new Token(type, type.getText());
    }
    else
        result = new Token(Token.Type.SYMBOL, text);
}

The while loop is a direct implementation of the non-initial token character consume loop, and collects letter characters into a buffer. The check against keywordTypes then determines if the given token is a reserved keyword (like if or true) or simply a variable name, and the appropriate token is then returned.

The other token types are processed using slight variations on this simple theme — read in all relevant characters, then build a token — although their character class recognition code is somewhat more complex. Together, the cases for the variable/keyword token type and the other token types define the primary method of the Tokenizer class: nextToken.

public Token nextToken() throws IOException, ParseException {
    Token result;

    w();

    if(getInput().peek() == -1)
        result = new Token(Token.Type.EOF, Token.Type.EOF.getText());
    else
    if(Character.isLetter(getInput().peek())) {
        // See above
    } else
    // Other cases are handled here...
    else
        throw new ParseException("Unrecognized character: "+(char) getInput().peek(), getInput().getOffset());

    return result;
}

Implementing the testing harness: Stork.java

The core tokenization classes are driven by an additional test harness class captured in Stork.java. This class exposes a simple REPL that takes a series of lines as user input and emits the tokens contained in the input line.

Downloading and Running Stork

Stork is designed to be easy to download, build, and run. As along as you have Git, Maven, and a JDK installed, you can get Stork up and running on your computer like this:

$ git clone https://github.com/sigpwned/stork.git
$ cd stork
$ git checkout lesson1
$ mvn compile
$ cd target/classes
$ java com.sigpwned.stork.Stork
>>> hello world if 5.0 6 17 "hello"
    SYMBOL  hello
    SYMBOL  world
    IF      if
    FLOAT   5.0
    INT     6
    INT     17
    STRING  hello
>>> ^D
$

Next Lesson: Parsing

The next step in building Stork will be to transform the tokenizer’s token stream into syntactical models called ASTs using a process called “parsing”. The next lesson will cover how to build a recursive descent parser by hand to implement simple mathematical expressions, and this parser will continue to be extended throughout the course as Stork gains more and more features.

Exercises

  1. Right now, Stork only allows letters in its keywords and variable names. Would it be a good idea to let users start variable names with an underscore? How about a number? Why or why not?
  2. Modify the program to allow users to include underscores in variable names.
  3. Modify the program to recognize the new keyword end.

Footnotes

1. In practice, it is not necessary to make token types fully resolvable from only the token’s first character. Some tokenization-time ambiguity can be resolved by inspecting preceding tokens or by passing the tokenizer “hints” about what token meta type the parser is expecting next. However, as a design principle, it’s best to avoid this level of sophistication during tokenization without a Very Good Reason.