Compiler Construction/A recipe for writing a reusable grammar

So you need to write a grammar. But there are multiple important practical concerns that you will have to deal with:

You need your parser be usable from your language.
You also may want it be reusable from other languages.
The language you use may have a very unmature parser generator it is impossible to use for anything complex:
- It may completely lack debug tooling.
- It may return very weird error messages even though your grammar looks OK.
- Even if everything is compiled fine, the generated parser may crash in runtime.
The parsers generated by convenient parser generators may be damn slow.

So here is how we address these issues. The general idea of attacking the issues is simple:

do everything step-by-step;
use the right tools.

1. Create enough **simple** samples of the language you want to parse.

2. Look at them. Ask yourself what minimum class of a grammar your grammar must be. Is it possible to create a context-free grammar? If it is not, is it possible to reduce your grammar to a context-free, resolving non-context-free stuff after parsing tree has been built?

3. Select the language you want to target. In my case it was python.

4. Find parser generators for it. For each geherator get known the following info:

class of grammars it can generate;
is it tokenizer/lexerless or tokenizerful;
does it support modular grammars when each subpiece can be developed and tested separately;
availibility of debugging and visualization tools:
- if the parser is tokenizerfull, there should be possible to print tokens before parsing;
- tracer;
- tree visualizer;
availability of libraries of precreated grammars;
availability of good documentation and examples;
is it compiled into target language constructions doing parsing directly or into a data structure interpreted by a runtime?
even if it is transformed into code, the code may still use some runtime. What is the efficiency and dependencies of the runtime used?
which other programming languages does the parser generator support?

5. Search Internet for precreated grammars. It can be possible that the grammar already exists, and you need only to adapt it to your parser generator.

6. Select your initial parser generator.

It must be GLR. You can lower the class and optimize the later, for now you need to write it at high level and make it work.
It should be tokenizerless.
It must have tools allowing you to trace its execution.
It must return a parse tree.
It must be possible to visualize the parse tree.
it must support naming tokens ahd skipping irrelevant tokens.

I used parglare in GLR mode. parglare has tool for debugging, allowing me to see the transition graph and path in it and possible paths and where an error has occured.

7. Select other parser generators. The criteria are the same, but they may be of another class. Your final goal is to make your grammar compatible to all of them and wrap the resulting parse tree into an abstraction layer, so your handcode should deal only to the abstraction layer. Changing a parser generator for the same or higher class is usually easy: just determine the mapping of one grammar DSL syntax features to another grammar DSL syntax features.

Downgrading a grammar class may require changing the parser generator and the grammar. It is very important to keep all the grammars for different generators in sync to have the same structure. Your goal is to write a grammar for one parser generator the way that it can be authomatically transformed for other parser generators. Keeping in sync may mean that the grammar for the higher or even the same class will get unnecessarily uglier: different parser generators have different syntax sugar, so you have to use the common subset. It doesn't matter that it goes a bit uglier. You can postprocess the tree to make it more convenient to work with later, after the tree is already parsed. What matters is that when you downgrade the parser you improve

performance
explainability of error messages
increase the set of parser generators that can be used with your grammar (one can use an LL(1) grammars in PEG and LR(1) parser generators but one cannot use an LR(1) grammar in LL(1) parser generator and one cannot use GLR in LR(1) parser generator).
- you increase the set of debug tools you can use to debug the grammar when developing it further;
- your grammar gets usefullness because more people can reuse it in the parser generators available for their languages;
- you restrict your grammars to the features universal to all parser generators, so it would be easier to port further.

8. Write a high-level GLR grammar.

Resolve all the errors and warnings emitted by the generator.
Debug it. Use tracing and parse tree visualization.
- Make it work anyhow on the simplest example.
- Then make it work on all the examples.
- Then make it return the parse tree you want an all the examples.

9. Optimize the grammar to make it look clean. Separate tokens from productions. Debug and test it.

9. Deoptimize the grammar:

if the tokenizer uses regexps, replace the regexps with tokens and productions. Most of generators don't support regexps as tokens. It may degrade performance (regexps are executed in native code, but the parser is executed in python), but universality is more valuable here;
separate groups into separate tokens/productions. Rationale: some generators don't have them;
separate character classes from tokens consisting of them. Rationale: some generators require separation between character classes and tokens.

10. Start writing a high-level wrapper for your parser generator. Create an abstract class/trait/interface representing a wrapper for your parser. You will usually need the following abstract methods:

to parse a string/file into parse tree;
to get named parts of the parse tree by their names;
to walk repeated tokens;
to transform a subtree into a string by recursively walking it. You will have redundant tokens in future while downgrading class of your grammar.

Implement the interface. Implement the code using the interface and not dealing with low-level details. Debug it. Build a minimal application around your parser you can test.

11. Downgrade the grammar to [[:w:LR_parser|LR(1)] (I have used the same parglare in LR mode), resolve all the compilation issues and debug and improve it again. Create/adapt the parse tree postprocessor backends. Make the library work with each parser backend. You will need to do the same procedure when bringing in support of each parser generator.

12. Port the grammar to PEG. I have used waxeye. It didn't have any debug tools, but it is an interpreter and a pretty simple one, I had to embed some tracing code into its python runtime. Debug, synchronize, fix the code test.

13. Downgrade the grammar to LL(*). I have used ANTLR, it has a debug tool allowing me to see the tokens types and match traces and inspect the parse tree. Here are some advices on porting LR(1) to LL(*):

Tokens cannot contain the same characters iterated as in other tokens, otherwise they cannot be disambiguated. So all the tokens have to be partitioned to distinct character classes. What once was a single tokens now will be multiple tokens within a production. For example

Something: NonDigit+;
Name: AlphaNum+;
NonDigit: ~[0-9];
AlphaNum: [a-zA-Z_0-9-];

must be transformed into

somethingSegment: NonDigitNonAlphaNum | Alpha;
nameSegment: Digit | Alpha;
name: nameSegment+;
something: somethingSegment+;

Alpha: [a-zA-Z_-];
NonDigitNonAlphaNum: ~[0-9a-zA-Z_-];
Digit: [0-9];

Make sure that all the backends work as intended and fix them and the grammars untill everything works.

16. Downgrade the grammar to LL(1). It may require further resolution of conflicts detected in compile time the same way (some conflicts that are here in compile time are not conflicts in LL(*)). I used pyCoCo - a variant of w:CoCo/R. It has very weird error messages and completely no debugging tools. The only reason why I was able to make this grammar work on CoCo/R is because of similarity of it to other parsers generators, so adapting this grammar to them resulted in adapting this grammar

17. Optimize, refactor, clean test everything.

18. Add to the repositories of grammars so other people can use it.

This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.