Introduction
In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we're done, we'll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3
, and hopefully you'll have gained the tools to handle much more.
My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I've been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().
If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus
.
Working knowledge of Python is required for the implementation section.
Grammars
For those of you who don't know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it's made of.
Here an example of a pseudo-grammar for parsing 1 + 2 + 3 + 4
:
Rule #1 - add IS MADE OF add + number OR number + number
Or in EBNF:
add: add '+' number | number '+' number ;
Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.
Here are the steps the parser will take:
- number + number + number + number
first pass turns all numbers into a 'number' rule - [number + number] + number + number
the parser found its first pattern! - [add + number] + number
after converting the pattern, it finds the next one - [add + number]
- add
The sequence of symbols has turned into a hierarchy of two simple rules: number+number
and add+number
, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.
Operator Precedence
Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:
1 + 2 * 3 / 4 - 5 + 6
Is equivalent to:
1 + (2 * 3 / 4) - 5 + 6
We can express this structure in the grammar by nesting the rules:
add: add + mul | mul '+' mul ; mul: mul '*; number | number '*' number ;
By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.
Let's pretend-run this grammar on 1 + 2 * 3 * 4
with our magical parser that is in my head:
- number + number * number * number
- number + [number * number] * number
the parser doesn't know what anumber+number
is, so this is his next pick - number + [mul * number]
- number + mul
- ???
Now we are in a bit of a pickle! The parser doesn't know what to do with number+mul. We can tell it, but if we keep looking we'll find that there are many possibilities we didn't cover, such as mul+number, add+number, add+add, etc.
So what do we do?
Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!
This method might seem strange at first, but it makes total sense:
add: add '+' mul | mul '+' mul | mul ; mul: mul '*' number | number '*' number | number ;
But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:
add: add '+' mul | mul ; mul: mul '*' number | number ;
Let's pretend-run on 1 + 2 * 3 * 4
again with this new grammar:
- number + number * number * number
There's no rule fornumber*number
now, but the parser can "get creative" - number + [number] * number * number
- number + [mul * number] * number
- number + [mul * number]
- [number] + mul
- [mul] + mul
- [add + mul]
- add
Success!!!
If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!
Running the parser
By now we have a fairly good idea of how we want our grammar to work. Let's apply it and write an actual grammar:
start: add; // This is the top of the hierarchy add: add add_symbol mul | mul; mul: mul mul_symbol number | number; number: '[d.]+'; // Regular expression for a decimal number mul_symbol: '*' | '/'; // Match * or / add_symbol: '+' | '-'; // Match + or -
You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let's run it on an expression!
>>> from plyplus import Grammar >>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add add mul number 1 add_symbol + mul mul number 2 mul_symbol * number 3 add_symbol - mul number 5
So pretty!
Study the tree and see what hierarchy the parser chose.
If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus
and paste the above commands inside python (make sure to put the actual grammar instead of '...' 😉 ).
Shaping the tree
Plyplus automagically creates a tree, but it's not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to "expand" (i.e. remove) rules by prefixing them. A @
will always expand a rule, a #
will flatten it, and a ?
will expand it if and only if it has one child. In this case, ?
is what we want.
start: add; ?add: add add_symbol mul | mul; // Expand add if it's just a mul ?mul: mul mul_symbol number | number; // Expand mul if it's just a number number: '[d.]+'; mul_symbol: '*' | '/'; add_symbol: '+' | '-';
Here's how the tree looks with the new grammar:
>>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add number 1 add_symbol + mul number 2 mul_symbol * number 3 add_symbol - number 5
Ooh, that is so much cleaner, and I dare say, quite optimal!
Parenthesis and Other Features
We are missing some obvious features: Parenthesis, unary operators (-(1+2))
, and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.
The important concept is to add a new rule, we'll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we'll make sure it's always expanded, by adding @
to it.
The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul;
, but that's tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.
Here's the final grammar, with all of these features:
start: add; ?add: (add add_symbol)? mul; ?mul: (mul mul_symbol)? atom; @atom: neg | number | '(' add ')'; neg: '-' atom; number: '[d.]+'; mul_symbol: '*' | '/'; add_symbol: '+' | '-'; WHITESPACE: '[ t]+' (%ignore);
Make sure you understand it, so we can proceed to the next step: Calculating!
Calculating!
We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.
This is the part we start writing code, so I need to explain two things about the tree.
- Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched.
- By default, Plyplus removes unnecessary tokens. In our example, the '(' and ')' will already be removed from the tree, as well as neg's '-'. Those of add and mul won't be removed, because they have their own rule, so Plyplus knows they're important. This feature can be turned off to keep all tokens in the tree, but in my experience it's always more elegant to leave it on and change the grammar accordingly.
Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It's your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let's see how it's done.
>>> import operator as op >>> from plyplus import STransformer class Calc(STransformer): def _bin_operator(self, exp): arg1, operator_symbol, arg2 = exp.tail operator_func = { '+': op.add, '-': op.sub, '*': op.mul, '/': op.div }[operator_symbol] return operator_func(arg1, arg2) number = lambda self, exp: float(exp.tail[0]) neg = lambda self, exp: -exp.tail[0] __default__ = lambda self, exp: exp.tail[0] add = _bin_operator mul = _bin_operator
Each method corresponds to a rule name. If a method doesn't exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.
I use float() to parse numbers, because I'm lazy, but I can implement it using the parser as well.
I use the operator module for syntactic beauty. operator.add is basically 'lambda x,y: x+y', etc.
Alright, let's run the transformer and see how it turns out.
>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30')) 31.357142857142858
What does eval() think?
>>> eval('1 + 2 * -(-3+2) / 5.6 + 30') 31.357142857142858
Success!
Final Step: The REPL
For aesthetic reasons, let's wrap it up in a nice calculator REPL:
def main(): calc = Calc() while True: try: s = raw_input('> ') except EOFError: break if s == '': break tree = calc_grammar.parse(s) print calc.transform(tree)
You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py
8 Comments
Fascinating. How is this different from the Lex-Yacc combo, if you have used that? For reference, PLY :http://www.dabeaz.com/ply/ply.html
Thank you. Actually, PlyPlus uses the excellent PLY as its underlying engine.
The difference is in the features PlyPlus offers, such as an advanced grammar, and automatic tree construction.
For comparison, see PLY's calc implementation: http://www.dabeaz.com/ply/example.html
So, it's complaining about the unused WHITESPACE definition in the grammar. How do we put that to use?
WHITESPACE really isn't used, but it's not a helpful warning. I consider it a minor bug, which I'll fix if people show interest.
Anyway, it only prints it the first time you compile the grammar.
The title is kinda misleading. Its not really 50 lines if you have to import a non standard library.
I have a question about parsing
number + number * number * number
with the new grammer. As far as I can see, parsing is done from left to right. So, what makes our grammer to skip first "number" while searching for a pattern? Also, after creating first "mul", what makes grammer to skip transforming "mul -> add" and check for "mul + number"?
What am I missing?
That what makes a parser different than a simple pattern-matcher.
Parsers will only match a rule if it leads to a full solution of the text. An error occurs if there are no solutions, or if there is more than one solution (usually). How do they do it? That's complex parser theory, but in simple terms, an LR parser keeps a state, and it will only match a rule if it maintains a 'good' state. A pattern that leads it into a 'bad' state is ignored.
So I guess that is where PLY comes in. Thanks 🙂