Create a stand-alone LALR(1) parser in Python

Over the years, I noticed that many developers are reluctant to use parsing libraries, especially if the language they need to parse is relatively small. The reason is that they wish to avoid adding external dependencies to their project. Although pip (setuptools) can automatically fetch whatever is required, external dependencies makes zip releases harder, and it even seems to discourage user adoption for some reason! This tendency is quite strong, and can create strange effects. For example, Xonsh, a unix shell written in Python, gets around it by including the entire(!) PLY source code, which adds thousands of lines of code, spanning over many files.

So, it seems to me that there is real value in being able to generate a small stand-alone parser. And that is why I spent the past week adding this feature to Lark, my Pythonic parsing library. The resulting parser is much smaller than Lark itself, it loads much faster because the grammar is pre-compiled, and it's just as easy to use.

I'll demonstrate how to generate a standalone parser with Lark and how to use it, using the example of a naive JSON parser.

Step 1) Define the grammar

We define the grammar using EBNF, embellished with a few of Lark's special features.

To see a detailed explanation of the grammar, go to Lark's JSON Parser tutorial.

json.g

?start: value

?value: object
        | array
        | string
        | SIGNED_NUMBER      -> number
        | "true"             -> true
        | "false"            -> false
        | "null"             -> null

array  : "[" [value ("," value)*] "]"
object : "{" [pair ("," pair)*] "}"
pair   : string ":" value

string : ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS

%ignore WS

Step 2) Generate the parser

python -m lark.tools.standalone json.g > json_parser.py

At this point, we already have a working parser, that can generate a parse-tree! For example, here's how we can use it:

>>> from json_parser import Lark_StandAlone
>>> parser = Lark_StandAlone()
>>> tree = parser.parse('{"key": ["string", 3.14]}')
>>> print(tree.pretty())
object
  pair
    string      "key"
    array
      string    "string"
      number    3.14

Sometimes that's enough, but we want the JSON parser to create native Python objects. So the next step is to transform the tree.

Step 3) Write a transformer

A transformer provides callbacks (or handlers) for each rule in the tree. In this case, we want to convert object to dict, array to list, number to float, etc.

Once again, this is covered in detail in the JSON Parser tutorial.

from json_parser import Lark_StandAlone, Transformer, inline_args

class TreeToJson(Transformer):
    @inline_args
    def string(self, s):
        return s[1:-1].replace('\\"', '"')

    array = list
    pair = tuple
    object = dict
    number = inline_args(float)

    null = lambda self, _: None
    true = lambda self, _: True
    false = lambda self, _: False


parser = Lark_StandAlone(transformer=TreeToJson())

That's it! We have a working JSON parser. Here's what happens when we try the former example:

>>> parser.parse('{"key": ["string", 3.14]}')
{'key': ['string', 3.14]}

You can see the entire standalone example (including the generated parser, which is ~800 loc) here:

https://github.com/erezsh/lark/blob/master/examples/standalone


Categorised in: ,

3 Comments

  • gercino jr says:

    yeaaaaaahhhhhhhhhhhhhh! Thank you very much!!

  • Ali says:

    Hi, it seems that custom lexer can not be defined while using standalone parser. Why? Is there any workaround?

    Error:
    Some options are not allowed when loading a Parser: { 'lexer' }

    • erezsh says:

      The standalone parser is configured when generating it. Changing its configuration afterwards isn't possible. (except for a few settings that allow it)

      Why do you need to switch between lexers in the standalone?

Leave a Reply

Your email address will not be published. Required fields are marked *