Tuesday, February 24, 2009

Pydev, JavaCC and error recovery

Pydev so far had a very primitive way of trying to recover from errors: it tried to change the document to make it 'right', with some really simple text-based heuristics, which were effective for some common cases. Still that's really suboptimal for a number of cases, and, error recovery while parsing is the best approach for dealing with that.

With that in mind, error recovery is being added to the Pydev JavaCC grammar. I must say that one of the largest problems to add it is the lack of documentation on the subject and what's the most effective way to deal with it.

Also, JavaCC has no support for bactracking while parsing (it can lookahead to search which path to take, but after a path is taken, it cannot go back to change it), and if at sometime it cannot find a path to take or if some token in the path was not found, a ParseException is raised.

The docs I found about how to handle errors would point that the correct way of treating ParseExceptions is catching those and trying to revert to a stable state in the parsing machine, but I found one other interesting article on the subject Adding Automatic Syntax Error Repair to a Java-based Parser Generator (from Pieter van der Spek, Nico Plat and Kees Pronk) , where it explains how could auto-recovery be added to JavaCC.

In the end, looking in the Pydev grammar, I found one interesting property: as part of the effort to make pretty-printing of the Pydev AST, lots of tokens were being looked for and added as 'special tokens' to the nodes created.

i.e.: it'd go and look for a colon token and add it as a special token to the node before it -- and if not found, an error would be thrown at that point.

Now, I ended up extending that approach so that if it went to look for a token and didn't find it, the error for not finding it is reported, but not thrown as an exception. Instead, it goes on to create that token to be consumed after the current token.

The only other gotcha is that the grammar makes a skip for new lines, and it happens 'under the hood' for the grammar (this is a really tricky area), so, in the end the grammar had to be changed so that it would handle the suite() construct even if some indentation 'disappeared' -- but it only goes that path if there is a problem in the syntax, so, that's reported as an error.

In the end I mixed both approaches. A part goes for the 'preemptive' attempt to create tokens when we know it should be there and it's not and another handles the ParseExceptions to try to recover from it.

The preemptive constructs look like:

{this.findTokenAndAdd(":");} <COLON>, where findTokenAndAdd reports the error and creates the token if it's not found

And the others work around the ParseExceptions:

E.g.:

Token Name() #Name:{Token t;}
{
    try{
        t =
    }catch(ParseException e){
        t = handleErrorInName(e);
    }
    { ((Name)jjtThis).id = t.image; return t; } {}
}

1 comment:

Tom Copeland said...

Yeah, error recovery in JavaCC is tricky. Tokenizer errors are pretty much a disaster, while parser and tree builder errors are somewhat recoverable. And of course you want to find as many errors as possible in one pass when you're checking a data set.