For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 4 in a series of 10, so if you’re just joining now, you may want to take a peek at lessons 1, 2, and 3 to gear up a bit for this post.
Lesson 4 adds variables to Stork, which involves adding statements in addition to the expressions already in the language. The addition of variables provides fodder for some additional (and more interesting) static analysis as well. At the end of this lesson, Stork will be a working interpreter for simple numerical expressions with support for variables. (The variables will become much more interesting over the course of the next couple of lessons, which will add support for functions and control structures.)
$ java com.sigpwned.stork.Stork >>> var x:Int >>> x ERROR: Variable may not have been initialized: x >>> x = 1+2*(3+4) 15 >>> var y:Float >>> y = x 15.0 >>> x = y ERROR: Will not coerce to less precise type: Float -> Int >>> x = (cast Int) y 15 >>> x+y 30.0 >>> ^D $
For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 2 in a series of 10, so if you’re just joining now, you may want to check out lesson 1 to gear up a bit for this story.
Lesson 2 covers the basics of parsing numerical expressions using the tokenizer implemented in Lesson 1. Evaluation of these expressions will be handled in Lesson 3.
What is Parsing?
If a programming language is a (more) convenient language humans can use to describe tasks to computers, then parsing is the process of turning a program’s tokens into sentences, or “abstract syntax trees” (ASTs), that the computer can understand. For example, consider this simple mathematical expression for the area of a circle with radius 5:
For this program text, the tokens would be
5, and a parser would build the following AST for it:
Clearly, parsing is essentially “sentence diagramming” for a programming language.
This lesson covers how to transform a token stream into a parse tree like the above example. Looking at parse trees — the syntactic relationships among tokens — instead of the tokens themselves will make evaluating those expressions much easier in the next lesson.
For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 1 in the series, so if you’re just joining now, you haven’t missed much!
What is Tokenization?
If a programming language is a (more) convenient language humans can use to describe tasks and processes to computers, then tokenization is the process of turning a program’s raw program text into words, or “tokens,” that the computer can understand. For example, consider this simple Python program for the factorial function:
def factorial(n): if n == 0: return 1 else: return n * factorial(n-1)
For this program text, the tokens would be:
==, and so on. Looking at tokens — atomic units of program semantics — as opposed to characters makes the next lesson’s topic of “parsing,” or discovering the semantic relationships among the different parts of the program text, much easier.
In a very real sense, the tokenizer defines the vocabulary of the programming language.
There are several questions every developer gets asked sooner or later:
- How do you learn to program?
- Should I learn to program?
- What is a program?
- Where do programs come from?
- What is a programming language?
- Where do programming languages come from?
Most of these questions already have excellent answers online. But while the question of where programming languages come from is the topic of some excellent books, I’ve never been able to find a simple, straightforward, free answer online.
I’m happy/sad this Michael Scott is not the author of that book.
Stork is a simple, free programming language I’m writing in ten steps and documenting in writing, code, and Reddit threads. I think Stork fills that hole, and I hope that you, gentle reader, think that it does, too.
This is a simple spectrum of how fast some various popular programming languages fail. For a better discussion, have a look at my blog post on the subject.
Writing software is hard. If the guy next to you hasn’t created ten new bugs, management hasn’t added 10 new features, the customer hasn’t tightened the due date by a week, and your technical lead hasn’t passed a new programming commandment (“Thou shalt not use identifiers with fewer than 4 characters!”) – all by noon – then it’s a good day in Developerstan.
Writing software is hard enough just dealing with the practical, every-day, mundane bits without even having to think about the difficulty added by the actual technical work itself. To write really good, clean, correct code, the programmer has to have a million tiny things in his head – “What does
Integer.parseInt() throw if it fails?” “Which directory is
ServletContext.getResourceAsStream() relative to?” “Which language is this product in again?” – before he can even write a single line of code. And the really good programmer has to check all his assumptions and return codes before he can even get down to business.
Wouldn’t it be nice if your tools yelled at you when you miss something?