If you've ever used Lark, you probably already know that it's rather featureful compared to other parsing libraries. Even when it was first released, five years ago, it already had two parsing algorithms, several lexers, automatic AST construction, automatic line counting, and the list goes on and on.
What more could anyone want from a parser?
Apparently, quite a bit!
Over the years, Lark gained many useful and powerful features, that most of our users probably aren't even aware of.
For the rest of the article, I will try to amend that, by giving a short overview of the five features that I think are most noteworthy, one for each year of Lark's existence.
Table of contents:
Ready? Set. Let's go!
1. Grammar Composition
Grammar composition means safely importing rules, or even whole grammars, into other grammars.
There are several possible uses for this:
-
Library of terminals, rules, and rule templates: The library may consist of utilities or common patterns, that are used by different grammars.
-
Embedding DSLs, or combining languages: You could parse the combined languages in one pass, and let the parser worry about possible collisions.
-
Refining variants: Most languages have different versions and dialects. Grammar composition lets you define a common base, for the variants to inherit and extend or override.
So, how much of this can Lark do? Well, Lark can do all of it! But, with two important caveats, which I’ll get to in a moment.
First, let me explain how it works.
Importing rules
Lark has a sophisticated import system, that lets users import terminals, rules, and rule templates from other grammars.
It’s facilitated by the %import
directive:
%import .path.to.grammar (TERM1, TERM2, rule1, rule2, template1, template2) |
You can import from a relative path, or provide Lark with a list of search paths.
When importing rules (or templates), Lark will make sure to also import all of their dependencies for you. But it will import those dependencies into a separate namespace, so there’s no need to worry about name collisions.
Additionally, those imported rules can be extended or overridden, essentially implementing an inheritance model.
You can do that using the %override
and %extend
directives:
% override some_rule: new definition % extend some_other_rule: additional definition |
Personally, I’ve used it to implement different variants of SQL, with a shared base grammar, and a minimal amount of overrides and extends per dialect.
Caveats
So, what’s the catch?
The first caveat is that grammar composition is only guaranteed to work if you use Earley with the option lexer="dynamic_complete"
, because it's the only configuration that guarantees to parse *all* CFGs (Context-Free Grammars) without reservations. Using other lexers with Earley may cause incorrect parsing due to tokenization conflicts. Composition will also work with LALR, but there is a likelihood of it causing Reduce conflicts, or tokenization conflicts. However, with some care, in many cases those conflicts can be solved.
The second caveat is that the %ignore
directive can’t be imported, because it’s a global directive, and isn’t localized to specific rules. That means that in the root grammar, users have to re-declare %ignore
(if applicable) and make sure that it covers all the imported rules correctly. On the bright side, %ignore
isn’t necessary when using Earley, and can be avoided. Also, if there’s enough popular demand, we might one day implement localized ignores, which will fix the problem.
Would you like to know more?
See this example of using %extend
to add a match
statement to the Python grammar.
Also, Lark comes with a complete example of grammar composition, which shows how to use merge_transformers()
in order to combine transformers of different DSLs, while also dealing with namespaced symbols.
2. Interactive Parser
With the InteractiveParser
class, Lark introduces a completely new way to interact with your parser. The traditional design of parsers dictates that they run as a closed loop. You start them, and they run as independent actors, calling your callbacks as they see fit. Then the loop ends, and the parser exits. A parser isn’t supposed to be paused. Like the abyss, it isn’t meant to be peered into. But what if we could?
The interactive parser is a living object that you act upon. You can feed it tokens one by one, observe its state, and even make fully-functional copies of it. It gives you complete control to define your own parsing logic. For example, you could use it to add context-sensitive logic, or to implement backtracking (by making copies at checkpoints), or to debug your parser from the Python console. It also makes error handling a LOT easier.
One caveat, is that the interactive parser currently only works for the LALR(1) algorithm. (i.e. not with Earley)
Implement your own parsing logic
To refresh your memory, the first step to creating a parser is to create an instance of Lark:
from lark import Lark parser = Lark(my_grammar, parser = "lalr" ) |
Then, we would usually write something like parser.parse(text)
, to get the parse tree. But instead, we’re going to get an interactive parser object:
>>> parser.parse_interactive(my_text) <lark.parsers.lalr_interactive_parser.InteractiveParser object at ...> |
The InteractiveParser
class has a few notable methods:
-
iterate_parse()
- Returns an iterator that advances the parser state by state, yielding the matched tokens as they come. -
choices()
- Returns a dict of the acceptable token types in the current state. -
feed_token(token)
- Feeds the parser with aToken
instance -
resume_parse()
- Resume automated parsing from the current state. -
copy()
- Creates an independent copy of theInteractiveParser
instance. The two instances won’t affect each other.
Here’s a quick example of how you might use it. I recommend that you paste it into Python, section by section, so you can see what each part does.
from lark import Lark, Token # Create parser for a*b* language ab_grammar = '!start: "a"* "b"*' parser = Lark(ab_grammar, parser = "lalr" ) # Create an interactive parser on a specific input ip = parser.parse_interactive( "aaabb" ) # Loop through the tokens and states, and print them for token in ip.iterate_parse(): print ( "Token: " , token) print (ip.pretty(), '\n' ) print ( "Final state:" , ip.pretty()) # Feed a new 'b' that wasn't in the input ip.feed_token(Token( 'B' , 'b' )) # Print the resulting AST. # Because we pushed a 'b', he result will contain 3 'a's and 3 'b's. tree = ip.resume_parse() print (tree.pretty()) |
Easy error-handling
When Lark encounters an error while parsing, its default behavior is to throw an exception. But the parse()
method accepts an on_error
callback, that, if provided, will get called instead. The on_error
callback will be called with an InteractiveParser
instance, initialized to the state after the error occured. The callback can then manipulate the parse state, if necessary, and either return True
to ignore the error and resume parsing, or False
to raise the exception.
Here’s a little recipe to debug your parser from the inside, when it throws an error:
def debug_parser(ip: InteractiveParser): fixed = False print (ip.pretty()) # Print the current parser state breakpoint() # Take a look around and try to fix the error return fixed # Resume running if fixed, else raise the exception parser.parse(my_text, on_error = debug_parser) |
And here’s an example of using on_error
for error-handling bad input: Error handling using an interactive parser
Would you like to know more?
Read the docs about InteractiveParser
and on_error
.
3. Ports
While Lark’s reference implementation is in Python, the Lark ecosystem now boasts two ports of the parser, to Julia and to Javascript. The ports implement a subset of Lark, and use the same grammar, interface, and idioms, as Lark-Python.
That means you can write your grammar once, and use it in three different languages. All you need to do is translate your transformers.
Lerche (Julia)
Lerche reimplements in Julia everything that is in Lark’s version 0.11.1, except for the Earley parser, and a few minor features.
Translating between Python to Julia is relatively easy, as the languages are quite similar in terms of syntax and style.
Example code of using Lerche:
# Define transformer struct TreeToJson <: Transformer end @inline_rule string(t::TreeToJson, s) = replace(s[ 2 :end - 1 ], "\\\"" = > "\"" ) @rule array(t::TreeToJson, a) = Array(a) @rule pair(t::TreeToJson, p) = Tuple (p) ... # Create parser json_parser = Lark(json_grammar, parser = "lalr" , lexer = "standard" , transformer = TreeToJson()) # Parse text j = Lerche.parse(json_parser, test_json) |
Lark.js (Javascript)
Lark.js is a live port of Lark’s standalone generator to Javascript. The generator is still written in Python, but it generates a standalone Javascript parser. (instead of a standalone Python parser.)
That means you need to generate your parsers using Python, and can’t create parsers in the browser on-the-fly. But maybe in the future we’ll add that too. For now, there’s a webpack plugin (WIP) that auto-generates the parser for you, when you import a Lark grammar, and re-generates it whenever the grammar file changes.
Example JSON parser:
// import {get_parser} from './json_parser.js'; // <-- file generated by running Lark.js manually import {get_parser} from './json.lark' ; // <-- uses the webpack plugin // Define transformer let transformer = { number: ([n]) => parseFloat(n.value), string: ([s]) => s.value.slice(1, -1), array: Array.from, pair: Array.from, object: Object.fromEntries, null : () => null , true : () => true , false : () => false , } // Create parser const parser = get_parser({transformer}) // Parse text console.log( parser.parse(text) ) |
Live Port
When you run Lark.js, it serializes your grammar, and applies it to the Javascript parser template. The template itself is Lark code that was automatically transpiled from Python to Javascript, using a home-made transpiler that I wrote. (the transpiler itself isn’t available yet). By being transpiled, Lark.js won’t lag behind the Python version, even for major updates.
Of course, transpiled code has its own disadvantages, in terms of size, speed and readability. While certainly the generated code can be improved, it’s already in reasonable shape.
For example, the following code:
if not hasattr ( self , 'lexer' ) or dont_ignore: lexer = self ._build_lexer(dont_ignore) else : lexer = self .lexer |
Gets automatically translated to:
let lexer; if (!( "lexer" in this ) || dont_ignore) { lexer = this ._build_lexer(dont_ignore); } else { lexer = this .lexer; } |
4. Reconstructor
The reconstructor is a utility that reconstructs text from an AST. In other words, given a grammar and an AST, it produces text that would parse into that AST.
That’s more tricky than it might sound at first. Here’s why.
Parsers usually produce something called a Concrete Syntax Tree (CST), which is a tree that includes all the parsed branches and tokens. It’s easy to reconstruct the text from a CST - just join all the strings in order.
But CSTs are hard to read and use. And they are even harder to change buglessly.
That’s why Lark is built to produce ASTs (or something very close to one), by omitting punctuation, and by providing all kinds of operators for shaping the parse-tree. In the humble opinion of the author, it’s one of Lark’s best features.
The resulting AST is easy to read and change, but how can you turn it back into a CST? There is often little resemblance between the tree branch (that the parser created) and the original rule that the user wrote.
That’s exactly what the reconstructor does. It observes the grammar and your AST, converts it back to a CST by automatically inserting branches and tokens in all the right places, and finally converts the CST to text.
The reconstructor needs to know how to “reverse” several different operations, including missing tokens, inlined rules (_inlined_rule
and ?maybe_inlined
), and aliases (the ->
operator) which rename the branch.
This task, of matching each AST branch to the correct grammar rule, is a difficult task that contains ambiguity and recursion. The reconstructor employs Lark's own Earley parser, and runs it on a sort-of “reversed grammar” of reconstruction rules, in order to match them efficiently. The implementation itself is hard to grok, but it's pretty short, spanning only 300 LOCs.
How to use?
The reconstructor accepts a Lark instance and a parse tree, and produces a string, which is a possible input that would produce that tree.
Here’s an example for reconstructing JSON, given a grammar: (you can find the grammar in the examples folder)
json_grammar = "..." text_json_text = "..." # Initialize the parser. Must disable maybe_placeholders for the reconstructor json_parser = Lark(json_grammar, maybe_placeholders = False ) # Create an AST tree = json_parser.parse(test_json_text) # Reconstruct JSON from the AST! r = Reconstructor(json_parser) new_json = r.reconstruct(tree) # The text might not be exactly similar, due to whitespace, but the contents are assert json.loads(new_json) = = json.loads(test_json) |
Would you like to know more?
See Lark’s example for parsing & reconstructing Python code
Another toy example shows how to use the reconstructor with tree templates, to convert Python 3 code to Python 2.
The reconstructor is used by the commentjson package, allowing it to be incredibly tiny, for what it does.
5. Lark-Cython
Lark-Cython is a newly released Lark plugin that can make your LALR parser run twice as fast, by only adding two lines of code.
I’ve simply reimplemented the lexer and parser in Cython, which is basically just Python with extra type definitions.
The x2 speed up is actually very modest compared to its potential. As the codebase improves and gets optimized, it’s possible that future versions will be much much faster.
Install:
pip install lark-cython |
Usage:
import lark_cython parser = Lark(grammar, parser = "lalr" , _plugins = lark_cython.plugins) # Use Lark as you usually would, with a huge performance boost |
To learn more and start using it, visit Lark-Cython
Honorable mentions
While these features don’t deserve a whole section, it’s still worth knowing about them!
-
Lark has an online IDE, that you can use to try grammars in a friendly environment. It’s written in Svelte, and contributions are welcome!
-
Lark provides a utility to automatically convert your
Tree
AST into a class-based AST. See an example.
3 Comments
Lark is amazing and I've learned a lot here---thank you!