How To Write A Calculator in 70 Python Lines, By Writing a Recursive-Descent Parser

Three months ago, I wrote a post detailing the process of writing a calculator using a parsing library. The popular response, however, was that readers are far more curious about seeing a calculator written from scratch, with the batteries included but nothing else. I figured, why not?

Writing a calculator is simple, if you use hacks specific to arithmetic expressions, but the effect of hacks is nearly always the same: the solution isn't elegant, it's not extendable, and it's hard to understand intuitively. In my appreciation of a good challenge, and my aim at a beneficial post, I decided to write it using a mostly generic recursive-descent parser. In the same spirit as last time, I wanted to do it in as few lines as I reasonably can, so it's filled with hacks and tricks, but they're superficial and not specific to the task at hand.

This post is a detailed, step-by-step explanation of my implementation. If you want to jump straight to the code and figure it out by yourself, just scroll to the end of this post. Hopefully when you're done you'll have better understanding of how parsing works internally, and you'll be inspired to use a proper parsing library to avoid this entire bloody mess.

To understand this post, you should have a strong understanding of Python, and it's recommended to have some understanding of what parsing is and what it's for. If you're not sure, I recommend that you read my previous post, in which I thoroughly explain the grammar that I will be using in this post.

Step 1: Tokenize

The first step of processing the expression is to turn it into a list of individual symbols. This is the easiest part, and not the point of this exercise, so I allowed myself to cheat here quite a lot.

First, I defined the tokens (Numbers are notably absent; they're the default) and a Token type:

token_map = {'+':'ADD', '-':'ADD', 
             '*':'MUL', '/':'MUL', 
             '(':'LPAR', ')':'RPAR'}

Token = namedtuple('Token', ['name', 'value'])

And here's the code I used to tokenize an expression `expr`:

split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]

The first line is a trick that splits the expression into the basic tokens, so

'1.2 / ( 11+3)' --> ['1.2', '/', '(', '11', '+', '3', ')']

The next line names the tokens, so that the parser can recognize them by category:

['1.2', '/', '(', '11', '+', '3', ')']
-->
[Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token(name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

Any token that is not in the token_map is assumed to be a number. Our tokenizer lacks a property called validation which would prevent non-numbers from being accepted, but luckily the evaluator will handle this task later on.

That's it. Now that we have a list of tokens, our next step is to parse it into an AST.

Step 2: Define the grammar

The parser I chose to implement is a naive recursive descent parser, which is a simpler version of LL parsing. It's the simplest parser to implement, and in fact mine takes only 14 lines. It's a kind of top-down parser, which means that it starts by matching the highest rule (like: expression), and recursively tries to match its sub-rules until it matches the lowest rules (like: number). To put it another way, while a bottom-up (LR) parser will gradually fold tokens and rules into other rules, until there's only one rule left, a top-down (LL) parser like ours will gradually expand the rules into less abstract rules, until they completely match the input-tokens.

Before we get to the actual parser, let's talk about the grammar. In my previous post, I used an LR parser, and I defined the calculator grammar like this (caps are tokens):

add: add ADD mul | mul;
mul: mul MUL atom | atom;
atom: NUM | '(' add ')' | neg;
neg: '-' atom;

(If you don't understand this grammar, you should read my previous post)

This time I'm using an LL parser, instead of LR, and here's how I defined the grammar:

rule_map = {
    'add' : ['mul ADD add', 'mul'],
    'mul' : ['atom MUL mul', 'atom'],
    'atom': ['NUM', 'LPAR add RPAR', 'neg'],
    'neg' : ['ADD atom'],
}

There is a subtle change here. The recursive definitions of add and mul are reversed. This is a very important detail, and I need to explain it.

The LR version of this grammar uses something called left-recursion. When LL parsers see recursion, they just dive in there in an attempt to match the rule. So when faced with left-recursion, they enter infinite recursion. Even smart LL-parsers such as ANTLR suffer from this issue, though it probably writes a friendly error instead of looping infinitely like our toy parser would.

Left-recursion is easily solved by changing it to right-recursion, and that is what I did. But because nothing is easy with parsers, it created another problem: While left-recursion parses 3-2-1 correctly as (3-2)-1, right-recursion parses it
incorrectly as 3-(2-1). I don't know of an easy solution to this problem, so to keep things short and simple for you and me both, I decided to keep the incorrect form and deal with it in post-processing (see step 4).

Step 3: Parse into an AST

The algorithm is simple. We're going to define a recursive function that receives two parameters: The first is the name of the rule that we're trying to match, and the second is the list of tokens we have left. We'll start with add (which is the highest rule) and with the entire list of tokens, and have the recursive calls become increasingly more specific. The function returns a tuple: The current match, and a list of the tokens that are left to match. For the purpose of short code, we'll make it capable of also matching tokens (they're both strings; one is UPPER-CASE and the other lower-case).

Here's is the code for the parser:

RuleMatch = namedtuple('RuleMatch', ['name', 'matched'])

def match(rule_name, tokens):
    if tokens and rule_name == tokens[0].name:      # Match a token?
        return RuleMatch(tokens[0], tokens[1:])
    for expansion in rule_map.get(rule_name, ()):   # Match a rule?
        remaining_tokens = tokens
        matched_subrules = []
        for subrule in expansion.split():
            matched, remaining_tokens = match(subrule, remaining_tokens)
            if not matched:
                break   # no such luck. next expansion!
            matched_subrules.append(matched)
        else:
            return RuleMatch(rule_name, matched_subrules), remaining_tokens
    return None, None   # match not found

Lines 4-5 check if rule_name is actually a token, and if it matches the current token. If it does, it will return the match, and which tokens are still left to consume.

Line 6 iterates over the sub-rules of rule_name, so each can be matched recursively. If rule_name is a token, the get() call will return an empty tuple and the flow will fall to the empty return (line 16).

Lines 9-15 iterate over every element of the current sub-rule, and try to match them in sequentially. Each iteration tries to consume as many matching tokens as possible. If one element did not match, we discard the entire sub-rule. However, if all elements matched, we reach the else clause and return our match for rule_name, with the remaining tokens to match.

Let's run it and see what we get for 1.2 / ( 11+3).

>>> tokens = [Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token (name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

>>> match('add', tokens)

(RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='1.2')]), Token(name='MUL', value='/'), RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='LPAR', value='('), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='11')])]), Token(name='ADD', value='+'), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='3')])])])]), Token(name='RPAR', value=')')])])])]), [])

The result is a tuple, of course, and we can see there are no remaining tokens. The actual match is not easy to read, so let me draw it for you

    add
        mul
            atom
                NUM '1.2'
            MUL '/'
            mul
                atom
                    LPAR    '('
                    add
                        mul
                            atom
                                NUM '11'
                        ADD '+'
                        add
                            mul
                                atom
                                    NUM '3'
                    RPAR    ')'

This is what the AST looks like, in concept. It's a good practice to imagine the parser run in your mind, or on a piece of paper. I dare say it's necessary to do so if you want to grok it. You can use this AST as a reference to make sure you got it right.

So far we've written a parser capable of correctly parsing binary operations, unary operations, brackets and precedence.

There's only one thing it does incorrectly, and we're going to fix it in the next step.

Step 4: Post Processing

My parser is not perfect in many ways. The important one is that it cannot handle left-recursion, which forced me to write the grammar as right-recursive. As a result, parsing 8/4/2 results in the folowing AST:

    add
        mul
            atom
                NUM 8
            MUL '/'
            mul
                atom
                    NUM 4
                MUL '/'
                mul
                    atom
                        NUM 2

If we try to solve the expression using this AST, we'll have to calculate 4/2 first, which is wrong. Some LL-parsers choose to fix the associativity in the tree. That takes too many lines ;). Instead, we're going to flatten it. The algorithm is simple: For each rule in the AST that 1) needs fixing, and 2) is a binary operation (has three sub-rules), and 3) its right-hand operand is the same rule: flatten the latter into the former. By "flatten", I mean replace a node with its children, in the context of its parent. Since our traversal is DFS post-order, meaning it starts from the edge of the tree and works its way to the root, the effect accumulates. Here's the code:

    fix_assoc_rules = 'add', 'mul'

    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]

    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)

This code will turn any structural sequence of additions or multiplications into a flat list (without mixing each other). Parenthesis break the sequence, of course, so they won't be affected.

From this point I could re-build the structure as left-associative, using code such as

    def build_left_associativity(tree):
        new_nodes = _recurse_tree(tree, build_left_associativity)
        if tree.name in fix_assoc_rules:
            while len(new_nodes)>3:
                new_nodes[:3] = [RuleMatch(tree.name, new_nodes[:3])]
        return RuleMatch(tree.name, new_nodes)

But I won't. I'm pressed for lines of code, and changing the evaluation code to handle lists takes a lot less lines than rebuilding the tree.

Step 5: Evaluate

Evaluating the tree is very simple. All that's required is to traverse the tree in a similar fashion to the post-processing code (namely DFS post-order), and to evaluate each rule in it. At the point of evaluation, because we recurse first, each rule should be made of nothing more than numbers and operations. Here's the code:

    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]

    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }

    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)

I wrote calc_binary to evaluate both addition and multiplication (and their counterparts). It evaluates lists of either, in a left-associative fashion, thus bringing our little LL-grammar annoyance to conclusion.

Step 6: The REPL

The plainest REPL possible:

    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

Please don't make me explain it 🙂

Appendix: Tying it all together: A calculator in 70 lines

    '''A Calculator Implemented With A Top-Down, Recursive-Descent Parser'''
    # Author: Erez Shinan, Dec 2012
 
    import re, collections
    from operator import add,sub,mul,div
 
    Token = collections.namedtuple('Token', ['name', 'value'])
    RuleMatch = collections.namedtuple('RuleMatch', ['name', 'matched'])
 
    token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'}
    rule_map = {
        'add' : ['mul ADD add', 'mul'],
        'mul' : ['atom MUL mul', 'atom'],
        'atom': ['NUM', 'LPAR add RPAR', 'neg'],
        'neg' : ['ADD atom'],
    }
    fix_assoc_rules = 'add', 'mul'
 
    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]
 
    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }
 
    def match(rule_name, tokens):
        if tokens and rule_name == tokens[0].name:      # Match a token?
            return tokens[0], tokens[1:]
        for expansion in rule_map.get(rule_name, ()):   # Match a rule?
            remaining_tokens = tokens
            matched_subrules = []
            for subrule in expansion.split():
                matched, remaining_tokens = match(subrule, remaining_tokens)
                if not matched:
                    break   # no such luck. next expansion!
                matched_subrules.append(matched)
            else:
                return RuleMatch(rule_name, matched_subrules), remaining_tokens
        return None, None   # match not found
 
    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]
 
    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)
 
    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)
 
    def calc(expr):
        split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
        tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]
        tree = match('add', tokens)[0]
        tree = flatten_right_associativity( tree )
        return evaluate(tree)
 
    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

Tags: , , , ,

Categorised in: , ,

10 Comments

  • Ivan Savov says:

    Thanks for the cool post. I am not sure I got the difference between the LL and LR parser. Is what you have above an LR parser?

    Also, why did you choose to represent both + and - as "ADD" tokens
    (and * and / as "MUL") is this to enforce evaluation priority? It would be interesting to see if you can add ** or ^ as an exponent for this calculator.

    Maybe you intended this post strictly as as an educational post, but I think doing parsing right (and from first principles) is a really cool thing to have. Check out how khan-exercises framework uses to parse math expressions into an AST:
    https://github.com/Khan/khan-exercises/blob/master/utils/math-model.js

    Peace out,
    Ivan

    • erezsh says:

      Here's my hackernews response, for reference:

      An LR-parser tries to reduce the input over and over again into rules, eventually ending with the 'start' rule. So a+b+c+d becomes [add]+c+d -> [add]+d -> [add] -> [start]

      An LL-parser tries to expand the initial rule into a more complex rule structure, until it matches the input. So to match a+b+c+d it will do [start] -> [add] -> [add] + [num] -> [add] + [num] + [num] -> etc.

      What I wrote is an LL-parser, simply because it's much much simpler to write and to understand.

      Yes, both ADD and MUL are used for precedence. Since any list of +- or of */ will evaluate correctly if reduced from left to right, I didn't mind grouping them together and making my life easier (and shorter).

      It was strictly educational, and also a shtick; a short code hack. If I was to write an actual parser (and I don't think I would ever try to), it would look very different!

  • Josh says:

    Hi,

    A very nice write-up! The calculator isn't really recursive descent since your functions don't implement the production rules of the grammar. Here is a simple calculator I made as an example of recursive descent parsing. Note that each function corresponds to a term in the grammar. I hope the example proves useful to you.

    https://gist.github.com/ascv/5022712

    Cheers,
    Josh

  • Nice article! I'm a Python programmer from Spain and recently interested in domain-specific languages. Would you mind if I do a translation to spanish and publish it on my Blog? Thank you for writting it, in any case.

Leave a Reply

Your email address will not be published. Required fields are marked *