How to Parse Perl 5 on the JVM

I have a simple rule for judging the accuracy of a newspaper, a periodical, a television program, and any other mechanism intended to report events accurately. If I read a report about an event I've attended or a subject which I know in detail and I find factual errors that modest research would have corrected (errors of opinion are fine, though why do even the mothers of Paul Krugman and Thomas Friedman read their nonsense?), I assume that said venue is as wrong about subjects about which I know even less.

Yes, it's been three months already, and the PERL IZ IMPARSABLES!!!! subject has come up yet again.

This time, a message from late last year lamenting the fizzling out of a project to port Perl 5 to the JVM spurred an awful Reddit headline (but I repeat myself).

Everything I wrote in On Parsing Perl 5 still applies. If you haven't read and understood it, opining about what Perl 5 does and doesn't do and what this does and doesn't apply contributes only to the endless 4channing of the Internet, and you should stop doing so.

With that said, with everyone now understanding that only undeclared barewords, modified operand contexts, and unparenthesized arity changes have any bearing on the successful static parsing of arbitrary Perl 5 code, the problem isn't as difficult as the email makes it sound. It's not trivial, and I'm not volunteering to do it, but it's not impossible and it doesn't require reimplementing Perl 5's parser in Java.

The secret is indirection.

Consider the code:

say foo 1, 2, 3;

Without any prototype on foo(), Perl 5 considers it a listary function which slurps up the arguments 1, 2, and 3. Its return value, if any, is the single argument to say.

With a prototype on foo(), the proper interpretation might be to consume only one argument, or only two arguments. At this point some people look at the static parsing possibilities of Perl 5 and throw up their hands. Yet the only place where any Perl 5 implementation really needs to know how many arguments the bare-word foo without parentheses needs to consume is at the point of execution...

... at which point foo() has a prototype, whether the "No one has given it a prototype, so it's listary!" or a defined prototype. In other words, all a static parser needs to do when it encounters a syntactic situation like this where the runtime behavior is uncertain is emit code that can look up foo's prototype when it comes time to evaluate the containing expression.

(A polymorphic inline cache could even make such code efficient, unless you want to allow code rewriting... but optimization takes place after the proof of concept works.)

Your syntax highlighter still might get this wrong, but executing Perl 5 code correctly in this case is still possible even if you don't evaluate code during compilation. One caveat to this approach is that certain compilation errors are difficult—the strict pragma's distaste for undeclared barewords, for example—though there are ways around that, as well.

Even so, as the mailing list message suggested, eliminating (euphemistically, discouraging) certain Perl 5 syntactic constructs would make parsing Perl 5 much more effective. I'm not sure how to handle source filters, for example.

With all of this said, the goal of not writing a Perl 5 parser for Perl 5 on the JVM is still troublesome, because any reasonably complete Perl 5 implementation still has to support string eval.

1 Comment

nilsonsfj.myopenid.com | May 28, 2010 3:12 PM

Groovy has string eval and it is, AFAIK, exclusively a JVM language.
It also features a supposedly awesome support for writing DSLs, although I've only used it and never written one myself.

So, in theory, it should be possible to have all these features. It might not be blazing fast, but it could be fast enough.

Tags:

1 Comment

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry