Python Parsing Ep. I: Lexing Python (with ANTLR)

For a while now, I've been writing a code editor for Python. I know there are quite enough of these, but I wanted something a bit different. I decided to write it in C#, and I'm glad I did - C#'s GUI is really easy and fun to write, probably the best GUI-writing experience I've had since VB.

So, at some point of the project I did the horrible mistake of attempting to parse Python. Now, had I been trying to parse C, or maybe Perl, or actually almost any other language (except for C++, which is a parser's worst nightmare) it would've been a piece of cake. But no, I had to try python. And so, it began.

The Parser
Unlike for some other languages, deciding on a parser generator for C# was easy: There's ANTLR, and there's the rest. No competition. (Later I would find out about Grammatica, which is pretty good. More on that some other time).

ANTLR's website provides a variety of grammars for the lazy programmer, and among them a grammar for Python. Thinking that I evaded real work, I tried to compile it:

F:antlr_play> Antlr3.Tool.exe Python.g
ANTLR Parser Generator  Version 3.0 (May 17, 2007)  1989-2007
warning(105): Python.g:289:17: no lexer rule corresponding to
                               token: INDENT
warning(105): Python.g:289:32: no lexer rule corresponding to
                               token: DEDENT
... (some more warnings about how the grammar is not
     deteministic) ...

Ouch. Looks like someone did not implement INDENT and DEDENT in the lexer.
Turns out that this grammar was written for Java, which has some kind of a special pre-processor (or post?) for Python (called PythonTokenStream.java) to produce the right indentation tokens.

C# does not have one.

Googling produced no results that could help me, and so I set out to write the lexer grammar myself.

Lexing Python
Now, a few things about lexing python.

Python's indentation cannot be solved with a DFA. (I'm still perplexed at whether it can even be solved with a context-free grammar).
PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
CPython's tokenizer is written in C. It's ad-hoc, hand-written, and complex. It is the only official implementation of Python lexing that I know of.

But the best part is that I wasn't aware of any of these points when I started writing my lexer grammar.

What I did find was the state-machine algorithm for the indentation, found in Python's official documentation:

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

Or in python pseudo code:

stack = [0]
for line in code_lines:
  indent_level = get_indent_level( line )
  if stack[-1] &lt; indent_level: 
    stack.append( indent_level ) 
    yield INDENT 
  else:
    while stack[-1] &gt; indent_level:
      stack.pop()
    if stack[-1] &lt; indent_level:
      raise IndentationError()
    yield DEDENT

  process_line ( line )

for dummy in stack[1:]:
  yield DEDENT

Now, this is clean and pretty. That is because it's a post-processor, and is not part of the lexer. But when you deal with indentation while tokenizing, you have to take into account python's quirks: line joinings, both explicit (backslash) and implicit (parenthesis and friends). Also, indents and dedents are not counted for empty lines, so you have to take whitespace into account only if it is followed by text (using lookahead). And of course the whitespace must be thrown away after processing it (or your parser grammar will be awful).

So what person in his right mind wouldn't do it as post-processing? Well, me, and anyone else who doesn't know better. But, despite the complications, I managed to get the grammar to work, and I would like to share it with you. Only four disclaimers:

It's written for C#. But with very little tweaking it should work for Java just the same (not like you need it).
It's kinda ugly. Admittedly, it's not my best piece of code. But ugly code for an ugly task. Anyway, at least it's working. Oh, see 3.
I've only tested it on few python files, and it's not impossible that it would fail on some eccentric piece of code.
I had to paste it because wordpress doesn't seem to allow uploading non-media files, and it doesn't fit the width of the page. Just copy-paste it if you want to read it properly.

With that in mind, here's the relevant part changed in the lexer grammar (again, on top of this grammar):


/** Treat a sequence of blank lines as a single blank line. If
* nested within a (..), {..}, or [..], then ignore newlines.
* If the first newline starts in column one, they are to be ignored.
*
* Frank Wierzbicki added: Also ignore FORMFEEDS (u000C).
*/
NEWLINE
@init {
int spaces = 0;
}
: ((('u000C')?('r')? 'n' ) | 't' | ' ' )* (('u000C')?('r')? 'n' )
leading_space = (' ' { spaces++; } | 't' { spaces += 8; spaces -= (spaces % 8); } )*
{
if ( tokenStartCharPositionInLine==0 || implicitLineJoiningLevel&gt;0 )
//$channel=HIDDEN;
Emit(new ClassicToken(NEWLINE,this.Text, HIDDEN));
else {
Emit(new ClassicToken(NEWLINE,this.Text));
//newLine = true;
}

if (implicitLineJoiningLevel==0)
{
if ( spaces &gt; IndentStack[IndentStack.Count - 1] ) {
IndentStack.Add(spaces);
Emit(new ClassicToken(INDENT,"&gt;"));

}
else
{
while (spaces &lt; IndentStack[IndentStack.Count - 1]) {
IndentStack.RemoveAt(IndentStack.Count - 1);
Emit(new ClassicToken(DEDENT,"&lt;")); } } } } ; WS : {tokenStartCharPositionInLine&gt;0}?=&gt; (' '|'t'|'u000C')+ {$channel=HIDDEN;}
;

/** Grab everything before a real symbol. Then if newline, kill it
* as this is a blank line. If whitespace followed by comment, kill it
* as it's a comment on a line by itself.
*
* Ignore leading whitespace when nested in [..], (..), {..}.
*/
LEADING_WS
@init {
int spaces = 0;
}
: {tokenStartCharPositionInLine==0}?=&gt;
( {implicitLineJoiningLevel&gt;0}? ( ' ' | 't' )+ {$channel=HIDDEN;}
| ( ' ' { spaces++; }
| 't' { spaces += 8; spaces -= (spaces % 8); }
)+
{
// make a string of n spaces where n is column number - 1
char[] indentation = new char[spaces];
for (int i=0; i&lt;spaces; i++) { indentation[i] = ' '; } String s = new String(indentation); Emit(new ClassicToken(LEADING_WS,new String(indentation))); } // kill trailing newline if present and then ignore ( ('r')? 'n' {if (token!=null) token.Channel = HIDDEN; else $channel=HIDDEN;})* // {token.Channel = 99; } ) ; /** Comments not on line by themselves are turned into newlines. b = a # end of line comment or a = [1, # weird 2] This rule is invoked directly by nextToken when the comment is in first column or when comment is on end of nonwhitespace line. Only match n here if we didn't start on left edge; let NEWLINE return that. Kill if newlines if we live on a line by ourselves Consume any leading whitespace if it starts on left edge. */ COMMENT @init { $channel=HIDDEN; } : {tokenStartCharPositionInLine==0}?=&gt; (' '|'t')* '#' (~'n')* 'n'+
| {tokenStartCharPositionInLine&gt;0}?=&gt; '#' (~'n')* // let NEWLINE handle n unless char pos==0 for '#'
;

So, this is it so far. I hope it was useful.
The next part of the saga will discuss ANTLR specifically, explain why using it to parse python can also be complicated sometimes, and of course provide a solution.

4 Comments

Frank Wierzbicki says:

Jul 14/2008, 20:13

Interesting work! By the way, I've fixed a few more small bugs in the Python.g parser, but only in the parser that is actually part of a development branch of Jython. At some point I'll put these together and submit them again to http://www.antlr.org (since the one for Jython actually produces an AST -- which is not what you necessarily want in an example grammar)

eliben says:

Jul 17/2008, 15:54

Parsing Python is much simpler than parsing C, so don't feel so bad.

erezsh says:

Jul 17/2008, 16:08

eliben, you can't possibly be serious.

4 Comments

Leave a Reply Cancel reply