How To Write A Calculator in 50 Python Lines (Without Eval)

Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we're done, we'll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3, and hopefully you'll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I've been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus.

Working knowledge of Python is required for the implementation section.

Grammars

For those of you who don't know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it's made of.

Here an example of a pseudo-grammar for parsing 1 + 2 + 3 + 4:

Rule #1 - add  IS MADE OF  add + number
                       OR  number + number


Or in EBNF:

add: add '+' number
   | number '+' number
   ;

Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

  1. number + number + number + number
    first pass turns all numbers into a 'number' rule
  2. [number + number] + number + number
    the parser found its first pattern!
  3. [add + number] + number
    after converting the pattern, it finds the next one
  4. [add + number]
  5. add

The sequence of symbols has turned into a hierarchy of two simple rules: number+number and add+number, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

1 + 2 * 3 / 4 - 5 + 6

Is equivalent to:

1 + (2 * 3 / 4) - 5 + 6

We can express this structure in the grammar by nesting the rules:

add: add + mul
   | mul '+' mul
   ;
mul: mul '*; number
   | number '*' number
   ;

By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.
Let's pretend-run this grammar on 1 + 2 * 3 * 4 with our magical parser that is in my head:

  1. number + number * number * number
  2. number + [number * number] * number
    the parser doesn't know what a number+number is, so this is his next pick
  3. number + [mul * number]
  4. number + mul
  5. ???

Now we are in a bit of a pickle! The parser doesn't know what to do with number+mul. We can tell it, but if we keep looking we'll find that there are many possibilities we didn't cover, such as mul+number, add+number, add+add, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

This method might seem strange at first, but it makes total sense:

add: add '+' mul
   | mul '+' mul
   | mul
   ;
mul: mul '*' number
   | number '*' number
   | number
   ;

But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul
   | mul
   ;
mul: mul '*' number
   | number
   ;

Let's pretend-run on 1 + 2 * 3 * 4 again with this new grammar:

  1. number + number * number * number
    There's no rule for number*number now, but the parser can "get creative"

  2. number + [number] * number * number
  3. number + [mul * number] * number
  4. number + [mul * number]
  5. [number] + mul
  6. [mul] + mul
  7. [add + mul]
  8. add

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let's apply it and write an actual grammar:

start: add;             // This is the top of the hierarchy
add: add add_symbol mul | mul;
mul: mul mul_symbol number | number;
number: '[d.]+';       // Regular expression for a decimal number
mul_symbol: '*' | '/'; // Match * or /
add_symbol: '+' | '-'; // Match + or -

You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let's run it on an expression!

>>> from plyplus import Grammar
>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      add
        mul
          number
            1
      add_symbol
        +
      mul
        mul
          number
            2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    mul
      number
        5

So pretty!

Study the tree and see what hierarchy the parser chose.

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus and paste the above commands inside python (make sure to put the actual grammar instead of '...' ;-) ).

Shaping the tree

Plyplus automagically creates a tree, but it's not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to "expand" (i.e. remove) rules by prefixing them. A @ will always expand a rule, a # will flatten it, and a ? will expand it if and only if it has one child. In this case, ? is what we want.

start: add;
?add: add add_symbol mul | mul;       // Expand add if it's just a mul
?mul: mul mul_symbol number | number; // Expand mul if it's just a number
number: '[d.]+';
mul_symbol: '*' | '/';
add_symbol: '+' | '-';

Here's how the tree looks with the new grammar:

>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      number
        1
      add_symbol
        +
      mul
        number
          2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    number
      5

Ooh, that is so much cleaner, and I dare say, quite optimal!

Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators (-(1+2)), and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we'll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we'll make sure it's always expanded, by adding @ to it.

The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul;, but that's tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Here's the final grammar, with all of these features:

start: add;
?add: (add add_symbol)? mul;
?mul: (mul mul_symbol)? atom;
@atom: neg | number | '(' add ')';
neg: '-' atom;
number: '[d.]+';
mul_symbol: '*' | '/';
add_symbol: '+' | '-';
WHITESPACE: '[ t]+' (%ignore);

Make sure you understand it, so we can proceed to the next step: Calculating!

Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

  1. Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched.
  2. By default, Plyplus removes unnecessary tokens. In our example, the '(' and ')' will already be removed from the tree, as well as neg's '-'. Those of add and mul won't be removed, because they have their own rule, so Plyplus knows they're important. This feature can be turned off to keep all tokens in the tree, but in my experience it's always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It's your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let's see how it's done.

>>> import operator as op
>>> from plyplus import STransformer

class Calc(STransformer):

def _bin_operator(self, exp): arg1, operator_symbol, arg2 = exp.tail

operator_func = { '+': op.add, '-': op.sub, '*': op.mul, '/': op.div }[operator_symbol]

return operator_func(arg1, arg2)

number = lambda self, exp: float(exp.tail[0]) neg = lambda self, exp: -exp.tail[0] __default__ = lambda self, exp: exp.tail[0]

add = _bin_operator mul = _bin_operator

Each method corresponds to a rule name. If a method doesn't exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I'm lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically 'lambda x,y: x+y', etc.

Alright, let's run the transformer and see how it turns out.

>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30'))
31.357142857142858

What does eval() think?

>>> eval('1 + 2 * -(-3+2) / 5.6 + 30')
31.357142857142858

Success!

Final Step: The REPL

For aesthetic reasons, let's wrap it up in a nice calculator REPL:

def main():
    calc = Calc()
    while True:
        try:
            s = raw_input('> ')
        except EOFError:
            break
        if s == '':
            break
        tree = calc_grammar.parse(s)
        print calc.transform(tree)

You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py

Tags: , , , , , ,

Categorised in: ,

8 Comments

  • Pacifier says:

    Fascinating. How is this different from the Lex-Yacc combo, if you have used that? For reference, PLY :http://www.dabeaz.com/ply/ply.html

  • Cliff Dyer says:

    So, it's complaining about the unused WHITESPACE definition in the grammar. How do we put that to use?

    • erezsh says:

      WHITESPACE really isn't used, but it's not a helpful warning. I consider it a minor bug, which I'll fix if people show interest.

      Anyway, it only prints it the first time you compile the grammar.

  • Anonymous says:

    The title is kinda misleading. Its not really 50 lines if you have to import a non standard library.

  • keremispirli says:

    I have a question about parsing

    number + number * number * number

    with the new grammer. As far as I can see, parsing is done from left to right. So, what makes our grammer to skip first "number" while searching for a pattern? Also, after creating first "mul", what makes grammer to skip transforming "mul -> add" and check for "mul + number"?

    What am I missing?

    • erezsh says:

      That what makes a parser different than a simple pattern-matcher.

      Parsers will only match a rule if it leads to a full solution of the text. An error occurs if there are no solutions, or if there is more than one solution (usually). How do they do it? That's complex parser theory, but in simple terms, an LR parser keeps a state, and it will only match a rule if it maintains a 'good' state. A pattern that leads it into a 'bad' state is ignored.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>