Using ANTLR and PerlXS to Generate a Parser


As I mentioned earlier, we're anticipating changing out the current Parse::RecDescent based parser in the Kynetx platform with one that will perform better. We've been going down the path of using ANTLR, a modern parser generator that supports multiple target languages. That flexibility was one of the key thing that got us interested in ANTLR. We might want to generate Ruby or Javascript KRL generators at some point.

But of course right now we want to generate a Perl parser since that's what the underlying Kynetx Event Service (KES) is written in (it's an Apache module). ANTLR doesn't support Perl. That's probably just as well however since we're after as much speed as possible. We could generate Java (the target best supported by ANTLR) but adding Java servers into the current operational mix doesn't excite me.

The obvious course is to generate C and then use PerlXS to integrate the resulting parser into the Perl-based KES engine. To explore the feasibility of that, I decided to play around with ANTLR generated C parsers and PerlXS to see how they'd work. What follows is an intermediate report of what I found.

I started with a SimpleCalc example that is part of the five minute introduction to ANTLR. The grammar file is unchanged from that example:

grammar SimpleCalc;

options
{
    language=C;
    output=AST;
    ASTLabelType=pANTLR3_BASE_TREE;

}

tokens {
\tPLUS \t= '+' ;
\tMINUS\t= '-' ;
\tMULT\t= '*' ;
\tDIV\t= '/' ;
}


/* PARSER RULES */
expr\t: term ( ( PLUS | MINUS )^  term )*;
term\t: factor ( ( MULT | DIV )^ factor )* ;
factor\t: NUMBER ;


/* LEXER RULES */
NUMBER\t: (DIGIT)+ ;
WHITESPACE : ( '\\t' | ' ' | '\\r' | '\
'| '\\u000C' )+ \t
     { $channel = HIDDEN; } ;
fragment DIGIT\t: '0'..'9' ;

The only difference is that I've told it to generate an AST and annotated the grammar (with ^) to tell it which tokens are tree nodes.

I used h2xs to generate the boilerplate xs files:

h2xs -A -n SC

This creates a directory called SC and a punch of files for PerlXS. That's where I put all the generated files from ANTLR. If you look at the C version in the ANTLR introduction, you'll see an @members declaration that contains some C code that exercises the parser. That's what I modified to put into the PerlXS file:

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "SimpleCalcLexer.h"
#include "SimpleCalcParser.h"

#include "ppport.h"

MODULE = SC\t\tPACKAGE = SC\t\t

char *
showtree(in)
         char * in
    CODE:

    pANTLR3_INPUT_STREAM           input;
    pSimpleCalcLexer               lex;
    pANTLR3_COMMON_TOKEN_STREAM    tokens;
    pSimpleCalcParser              parser;
    SimpleCalcParser_expr_return     langAST;

    char * output;

    input = antlr3NewAsciiStringInPlaceStream 
            (in,
            (ANTLR3_UINT32) strlen(in), 
            NULL); 
    lex    = SimpleCalcLexerNew(input);
    tokens = antlr3CommonTokenStreamSourceNew
               (ANTLR3_SIZE_HINT, 
\t\tTOKENSOURCE(lex));
    parser = SimpleCalcParserNew(tokens);

    langAST = parser->expr(parser);

    output = langAST.tree->toStringTree(langAST.tree)->chars;

    // Must manually clean up
    //
    parser ->free(parser);
    tokens ->free(tokens);
    lex    ->free(lex);

    RETVAL = output;

    OUTPUT: 
      RETVAL

This defines a function called showtree that will be called from Perl. The file also includes the .h files that ANTLR generated and uses the input string inplace instead of reading a file as the example did. The return value (denoted by the special identifier RETVAL) is just a string representation of the parse tree.

The Makefile.PL file for PerlXS is pretty standard:

use 5.010000;
use ExtUtils::MakeMaker;
WriteMakefile(
    NAME              => 'SC',
    VERSION_FROM      => 'lib/SC.pm', 
    PREREQ_PM         => {}, 
    ($] >= 5.005 ?     
      (ABSTRACT_FROM  => 'lib/SC.pm', 
       AUTHOR         => 'Web-san ') : ()),
    LIBS              => ['-lantlr3c'], 
    DEFINE            => '', 
    INC               => '-I.', 
    # link all the C files too
    OBJECT            => '$(O_FILES)',
);

You'll notice I link the ANTLR library in here. Since the xs file references a string output, I created a typemap file to map that for PerlXS:

TYPEMAP
char * T_PV

Now, we compile the xs files in the standard way:

perl Makefile.PL
make

The Perl file is pretty simple as well:

#!/usr/bin/perl -w

use ExtUtils::testlib;   # adds blib/* directories to @INC
use SC;
print SC::showtree("3 + 4 * 5"), "\
";

Executing this program prints a prefix representation of the arithmetic expression passed into the showtree function.

Of course, this isn't what we want for our system. We want a full fledged AST back that we can manipulate in Perl. I spent a little time on typemaps and have reached the conclusion that the right method is to use an ANTLR generated treeparser (a parser for the AST) to walk the tree and create a tree that is more like what we are used to in the KES engine and use typemap to turn that into Perl.

So, it would appear that using ANTLR to generate a C-based parser and then using PerlXS to wrap that for use in Perl is feasible. As we figure out the AST output, I'll write more.