Parsing with Perl


The system we're building at Kynetx includes a domain specific language that uses rules to create JavaScript programs that get sent down to the browser. I've documented our decision to use a domain specific langauge and our choice of Perl in other posts.

When I started this project, I was reading Mark Dominus' book Higher Order Perl and started using his HOP parser to play around with. One thing led to another an before you know it I had a full blown language parser in HOP without giving much thought as to whether or not I'd made the right choice.

I found the HOP parser to be pretty flexible, but it has it's quirks. More importantly, I didn't like the BNF specification format and so I was constantly trying to keep the spec and the implementation in sync. Better if I could just use the spec as the implementation 'ala Bison. Don't get me wrong, this is a great book with lots of wonderful ideas, but I wanted something else for the parser.

As I added more and more features to the language, it got to where I'd dread making the parser changes. Recently, I decided I had to significantly beef up the predicate expressions and thought it would be a good time to change out the parser as well.

A few months ago I picked up Christopher Frenz's Pro Perl Parsing in anticipation of just this day. Reading through it illuminated my choices and ultimately, I picked Damian Conway's Parse::RecDescent, a recursive descent parser over the other contender, Parse::Yapp. The reasons for my choice were partly esthetic and partly a trust in Damian. The main thing I was after was a parse spec that I could read and compile and RecDescent seemed great in that regard.

The biggest downside of RecDescent is that there's no associated Lexer. For most things that's not a big deal since terminals can be specified as regular expressions. The place where it really bit me was comments. Removing comments is trickier than you'd think because you don't want to process "start of comment markers" inside any quotes. With a lexer, that's easy; without one, it's more problematic. Writing the regexp to remove comments took me a while to get right. I ended up using a modified version of the solution given in this FAQ. The problem with most solutions, including Regexp::Common, which has a language comment module, is that they don't account for comment markers in quotes.

All in all, rewriting the parser was a good exercise and I'm happy with the choice of RecDescent. Here's a sample production from my file:

decl: VAR '=' VAR ':' VAR '(' expr(s? /,/) ')'
      {$return =
       {'lhs' => $item[1],
        'type' => 'data_source',
        'source' => $item[3],
        'function' => $item[5],
        'args' => $item[7]
       }
      }
    | VAR '=' 'counter' '.' VAR
      {$return =
       {'lhs' => $item[1],
        'type' => $item[3],
        'name' => $item[5]
       }
      }
    | VAR '=' HTML
      {$return =
       {'lhs' => $item[1],
        'type' => 'here_doc',
        'value' => $item[3]
       }
      }
    | 

This production for decl has three alternates. Each has a separate return value (a hash) that represents the portion of the abstract syntax tree created for that part of the input.

If you decide to give Parse::RecDescent a try, here are some resources:

Reading the documentation and the FAQ thoroughly is highly recommended. There's lots of little tricks that can make your job easier.

My job, replacing an existing parser, was made easier by the fact that I'd previously built a pretty thorough test suite in Perl for the parser and some related modules. So once I got the language spec pretty much complete, I started running the tests and correcting errors as they cropped up. In a few hours, I'd solved all the problems and was confident my parser was ready to go. Definitely the way to go.

At any rate, now I've for a shiny new parser that I can go modify. Fun!