For some reason, a PerlMonks post from last year, Perl Cannot Be Parsed, has resurfaced in online discussions.
As you might expect, most comments are ill informed and
blatant in their wrongness.
At least half of the comments are the equivalent of "OH noes i cantn't read
PERL LOLLERS now i is nos Y!!!", but that's what you get for reading Twitter,
Digg, YouTube, and occasionally Reddit. (If no one's invented the phrase "The
Eternal 4channing of the Internet", I claim it.)
Another significant portion of the comments come from people who've read only the headline and feel sufficiently enlightened that they can proclaim their wisdom to the world at large, despite comments and clarifications on that PerlMonks post from people who've actually worked on the Perl 5 internals and understand the implications of that statement and their limitations.
A few comments are from people who know what this means and what it implies and those (few) cases where it applies.
I suspect the rest of the programming world which might care persists in a limited state of confusion. What does this mean? Why does it matter? What happens now? This is especially true for people who use Perl but don't necessarily consider themselves programmers. They don't have computer science backgrounds, but they'd use good tools like Perl::Critic if they knew they were available.
On Static Parsing
The unparsability of Perl 5 is a static unparsability. "Static" is a specific term of art which means, basically, "You can't run any part of the program to determine what it means."
One of the most common points of confusion in the discussion is a misunderstanding of what "static" means. This doesn't mean that it's impossible to parse Perl 5. The files perly.y and perltoke.c in the Perl 5 source tree do just that. However, the existence of subroutine prototypes and bareword declarations combined with the possibility of running arbitrary code during compilation with BEGIN
blocks introduces ambiguity to the process.
That's all the proof addresses.
I realize that's not obvious to everyone reading this so far. That's fine. Just remember that the proof addresses important but limited circumstances.
Why limited? It is possible to parse Perl 5 code statically in many cases -- most cases, even. PPI is an example of this. PPI can build a document object model representing Perl 5 programs sufficient to perform static analysis; this is why Perl::Critic
works.
PPI isn't perfect, and it can't resolve all possible Perl 5 programs to unambiguous object models. Part of that is the malleability of Perl 5 syntax.
Changing the Parse
I mentioned subroutine prototypes and barewords; these are two cases where predeclaration changes how Perl 5 will parse existing code.
Subroutine prototypes can change the expected arity and parse and runtime behavior of subroutines. In general, you should avoid them unless you know exactly what this means and what it implies.
Changing Expectations
In specific circumstances, they're useful to change how Perl 5 passes
arguments to the functions. My favorite prototype lets you pass blocks as
anonymous subs:
sub wrap (&) { ... }
my $wrapped = wrap { do_this(); do_that() };
Without the &
prototype character, you'd have to call wrap()
with more verbosity:
my $wrapped = wrap( sub { ... } );
It's a small prettification, but it's occasionally useful.
There are two important points to note about this example. First, Perl 5
must encounter the sub declaration with its prototype prior to the point at
which it parses the call to wrap
. The prototyped declaration
changes how Perl 5 parses that construct -- otherwise the curly braces
would run through the "is it a hash reference or a block?" heuristic and would
not get promoted to an anonymous subroutine declaration.
Second, a smart and stateful static parser could identify this type of declaration and parse the code the same way the Perl 5 parser does. This is not by itself sufficient to make parsing Perl 5 statically indeterminate.
(Some people might object that statefulness in a static parser violates some
rule of computer theory somewhere, which is a true point but practically
useless; you need at least some statef in your parser to find the end
of a double-quoted string if you allow escaped embedded double quotes. The
question is not if you allow lookahead but how much lookahead you need. Also
parser people, remember that I reserve the right to tell small fibs to explain
things to an audience that won't ever take a class about automata.)
Changing Arity
Arity can be more interesting; this is a fancy programming term which means "how many arguments a function or operator takes". Perl 5 subroutines are usually greedy; they'll happily slurp up all arguments in a big list. Certain subroutine prototypes allow you to change this. For example, you can define a foo
subroutine which only takes two arguments:
sub foo ($$)
{
return "<@_>";
}
say foo 1, 2;
The prototype changes the parsing of foo
calls so that
providing more than two arguments is a syntax error:
say 'Parsing done';
say foo 1, 2, 3;
$ perl protos.pl
Too many arguments for main::foo at protos.pl line 13, near "3;"
Execution of protos.pl aborted due to compilation errors.
Again, Perl 5 must have encountered the prototyped declaration before it can apply the parsing rule changes to any code. Again, a clever and stateful static parser could detect these parser changes and adapt to them.
Barewords
The Perl 5 parser has a complex heuristic for figuring out what a bareword actually means. It could be a filehandle, a class name, a function, or more. Several good Perl 5 programmers recommend against the indirect object invocation syntax (though dative is a description with more technical accuracy for linguists) because it can parse differently depending on code Perl 5 has already executed.
my $obj = new Class foo => 'bar'; # do not mimic this example
Perl 5 has heuristics to determine what new
and Class
mean in this context. It may be obvious to readers that this code really means:
my $obj = Class->new( foo => 'bar' );
... but if the parser has already seen a subroutine declaration for Class
in this particular namespace (in certain circumstances, I believe it's possible to fool the parser with a specific way of declaring new
), the heuristic may choose the wrong option.
The indirect pragma is an effective way to root out this ambiguity from your code. There are some seven cases in which you can run afoul of this bareword heuristic; you can tell who's a real Perl 5 guru by asking him or her to list at least five. See Matt S. Trout's Indirect but Still Fatal for a fuller discussion of the issues here.
A very clever, very careful static parser could still replicate the complex heuristic in the Perl 5 parser and parse code appropriately.
The Real Problem
What's the source of the ambiguous parsing issue?
If Perl 5 requires predeclaration of all parse modifications, and if they're declarative, why is writing a static parser which can parse all Perl 5 programs in the same way that the Perl 5 parser does impossible?
Perl 5's compilation phase parses Perl 5 code into a parse tree. The execution phase traverses this tree. They're two separate stages...
... except that BEGIN
and implicit BEGIN
blocks can run arbitrary code during the compilation phase.
That is, you can write code that will not compile successfully:
sub foo ($$)
{
return "<@_>";
}
say foo 1, 2, 3;
... and insert a BEGIN
block to change the way Perl 5 will
parse subsequent code:
sub foo ($$)
{
return "<@_>";
}
BEGIN
{
no warnings qw( prototype redefine );
eval 'sub foo { return "<@_>" }'
}
say foo 1, 2, 3;
Even though the eval
code wouldn't normally run until after the compilation stage has finished and the execution stage has begin, the BEGIN
block causes the Perl 5 parser to execute its contents immediately after its parsing has finished -- before parsing the very next line.
This is a contrived example -- why would anyone write this code? -- but you
should start to see the problem.
Even though this syntax is declarative to the Perl 5 parser, a static
analysis tool would have to analyze the semantic meaning of code within the
BEGIN
block to determine if it has any effects. That means
peeking into double-quoted strings used as arguments to eval
.
Even if a static parser could do that, it'd be easy to confuse it again:
BEGIN
{
no warnings qw( prototype redefine );
eval 'sub' . ' foo ' . '{ return "<@_>" }'
}
Even though Perl 5 can coalesce those three concatenations to a single constant string during its optimization phase, the parser must get more complex to detect this case.
You can see where this is going.
Ultimately, completely accurate static parsing of Perl 5 programs is
impossible in certain cases because I could write:
BEGIN
{
no warnings qw( prototype redefine );
eval 'sub' . ' foo ' . '{ return "<@_>" }'
if rand( 1 ) > 0.5;
}
The static analysis tool can't even tell if this program would parse
successfully. About half the time, Perl 5 will report that it has a syntax
error. Half of the time it will succeed.
Thus there exist pathological cases written by curious (and potentially
malicious) people such as myself where a static Perl 5 parser cannot predict
what the actual Perl 5 parser will do. If you're morbidly curious, consider
changing the prototype of a symbol already parsed through raw typeglob
access.
The Implications
Don't panic. Static analysis tools can't parse all valid Perl 5 programs, but they can parse many Perl 5 programs. While the Perl 5 parser is complex with many special cases, it's possible to write a clever tool such as PPI
which is good enough for almost all code. Besides that, the language-bending manipulations that could trip up a static analysis tool are mostly declarative and mostly decidable, unless you're doing something beyond weird.
In most cases this doesn't matter. That example code is completely
unreliable. There are ways to disambiguate potentially ambiguous code -- and
good Perl 5 hackers recommend using those approaches anyway. Most languages without a very strict homoiconic representation have parsing ambiguities (and if you allow arbitrary code execution in templates or reader macros, even homoiconicity can't always save you).
Besides that, once you allow loading language extensions from languages with access to raw memory, all bets are off.
With a bit of work, I'm sure I could come up with ways to confuse static Ruby and Python parsers without resorting to C. Implicit variable declarations seems like one perilous approach....