Python Parsing Ep. II: Antlr Issues

Continuing that post.

Before I go on, I wish to make 2 corrections to the code provided in the last post:

  1. I think that the grammar for COMMENT is incorrect (in the original too). It requires comments to always end with a new-line. I changed that requirement to be optional, and now it seems more agreeable with end-of-files. For your consideration.
  2. The code for INDENT/DEDENT cannot work because ANTLR's Emit cannot be called twice from the same symbol: Only the last call registers. That is fixable by overriding Emit and NextToken (I found this code somewhere on the internet, cannot remember where). And so, here is my code for the lexer:
    @lexer::members {
    int implicitLineJoiningLevel = 0;
    int startPos=0;
    System.Collections.Generic.List<int> IndentStack = new System.Collections.Generic.List<int>() { 0 };
    
    System.Collections.Generic.List<IToken> tokens = new System.Collections.Generic.List<IToken>();
    public override void Emit(IToken token) {
            this.token = token;
    		token.Line = this.tokenStartLine;
    		token.CharPositionInLine = this.tokenStartCharPositionInLine;
        	tokens.Add(token);
    }
    public override IToken NextToken() {
        	base.NextToken();
            if ( tokens.Count==0 ) {
                return Token.EOF_TOKEN;
            }
    		Token t = (Token)tokens[0];
    		tokens.RemoveAt(0);
            return t;
    }
    
    }
    
    

Now that we're done with the lexing...

We have a lexer working, which inputs python code and outputs tokens. So now all we have to do is feed the formal grammar into ANTLR and our work is done, right?

Well, ANTLR's tree API is very unfriendly; I spent hours figuring it out and I still don't really get it. And since the parser just throws a sequence of tokens at you, I had no obvious way of constructing some hierarchy out of the code.
But even if I could somehow untangle the mysteries of ANTLR's trees, I had a bigger problem which is only relevant if you want to display the code (which I did): The output from the parser contains no regard for whitespace or comments that were originally there. I had to find the symbol's start and end in the original text, and copy the text from there. But symbols didn't end where they "ought to". Since comments were removed before reaching the parser, symbols didn't include them even in their position marker, so:

class A:
    hello = "world"
    # just some class

... would leave the comment out of the scope of the class, which is bad. It would be a lot easier to just write an interpreter...

But I conjured a solution to both of my problems at once: I will insert markers -- fake symbols -- at the start and end of all definitions. I took the chance to insert other markers which would ease my life. So my class definition looked something like that:

classdef:
	{ParserEmit($classdef.start, $classdef.stop, ParserToken.TokenType.ClassBegin, "");}
	CLASS classname
	{ParserEmit($classname.start, $classname.stop, ParserToken.TokenType.Name, $classname.text);}
	(LPAREN testlist? RPAREN
		{ParserEmit($testlist.start, $testlist.stop, ParserToken.TokenType.Args, $testlist.text);}
	)? colon
	(suite {ParserEmit($suite.start, $suite.stop, ParserToken.TokenType.ClassEnd, "");}
	| NEWLINE
		{ParserEmit($colon.start, $colon.stop, ParserToken.TokenType.Error, "Empty Class Def");
		 ParserEmit($colon.start, $colon.stop, ParserToken.TokenType.ClassEnd, "");
		}
	)
	;

(If you are familiar with ANTLR you'll notice that I'm using the 'Emit' function "incorrectly". I will explain soon enough)

Observant readers might notice that I only delayed the problem - the resulting stream of fake symbols are themselves a production of a formal grammar, and need to be parsed in order to construct hierarchy. True, but the result is a lot simpler than Python code; I am now dealing with symbols (=classes) and not with text (=strings).
However, this is not my only problem. As I said earlier, 'Emit' is only designed to be called once for each symbol. I didn't like Emit very much anyway, and so I created my own emit function (I called it 'ParserEmit'). It simply saves everything it receives in a list, which I can later iterate over in my code. Just for the record, here's how it looks:

public System.Collections.Generic.List<ParserToken> parser_toks =
	new System.Collections.Generic.List<ParserToken>();

void ParserEmit(IToken start, IToken stop, ParserToken.TokenType t, string text)
{
	ParserToken tok = new ParserToken() { type=t, start=start, stop=stop, text=text };
	parser_toks.Add(tok);
}

(ParserToken is a class I made up. You can figure out what it looks like)

Now I had a list of symbols telling me exactly where each class/function starts/ends.
Next post will be Ep. 3 and last, where I will describe how I turned those symbols into a tree, rant about text-obsession in parsers, and provide my complete ANTLR grammar for Python (which just has too many changes/additions to describe each individually).

Tags: , , ,

Categorised in:

Leave a Reply

Your email address will not be published. Required fields are marked *