In the discussions which prompted me to write On Parsing Perl 5, I've read many misconceptions of how Perl 5 works.
The strangest example is a comment on Lambda the Ultimate which contains an incorrect suggestion that Perl 5 subroutines take the source code of the program as an argument to resolve ambiguous parsing.
Someone elsewhere gave the example that Perl gurus preface answers to the question "Is Perl 5 interpreted or compiled?" with "It depends." (Part of the reason for that is that Larry himself often prefaces his answers to all sorts of questions with "It depends.")
Perl 5's execution model isn't quite the same as a traditional compiler (whatever that means) and it's definitely not the same as the traditional notion of an interpreter. There are two distinct phases of execution of a Perl 5 program: compile time and runtime. You must understand the difference to take full advantage of Perl 5.
Compile Time
The compilation stage of a Perl 5 program resembles the compilation stage as you may have learned it in a compilers class. A lexer analyses source code, producing individual tokens. A parser analyses patterns of tokens to build a tree structure representing the operations of the program and to produce syntax errors and warnings. An optimizer prunes and rebuilds the tree for efficiency and consistency.
Unlike the compilation model you may expect from a language implementation
which produces a serialized compilation artifact (think of a C compiler
producing a .o file, for example, or javac
emitting a
.class file), Perl 5 stores this data structure in memory only.
That's one way in which Perl 5 differs from other language implementations; it
manages the artifacts of compilation itself.
Certain operations happen only at compilation time -- looking up function
names where possible, binding lexical (my
) variables to lexical
pads, entering global symbols into symbol tables. One common error which
confuses novices is not realizing that my
declarations have
compile time effects while assignments have runtime effects.
Runtime
After Perl 5 has produced its tree -- the optree -- it begins executing the program by traversing the optree in execution order. Even though the tree structure is a tree for ease of representing operations, execution does not start at the root of the tree and proceed leafward. At this point in the program, the source code is gone.
Of course, certain runtime operations such as the eval STRING
operator or require
can begin a new, limited compilation time --
but they have no effect on source code already parsed into the optree. This is
important.
Executing Code During Compile Time
One of the difficulties in parsing Perl 5 code statically is that one Perl 5
linguistic construct executes code during compile time. The
BEGIN
block executes as soon as Perl 5 has successfully parsed it.
(See perldoc
perlmod for more information.)
Because BEGIN
temporarily suspends compilation, it can
manipulate the environment used by the parser to affect how the parser will
treat subsequent source code. An easy example is the case of importing symbols
from an external module:
use strict;
The Perl 5 parser treats use
statements as if you'd written
something like:
BEGIN
{
require 'strict.pm';
strict->import();
}
As soon as the parser reaches the semicolon, it executes this code. This
causes perl
to try to load strict.pm, compile it, and
then call its import()
method. Within that method, the
strict
module modifies lexically scoped hints, some of which
cause the parser to require declarations of variables and barewords.
Other modules can insert subroutines and variables into the calling package's symbol table; the vars pragma does this for package global variable declarations.
When the BEGIN
block ends successfully, the parser resumes at
the point where it left off. If the environment of the parse has changed,
subsequent parsing may behave differently. This is why this program gives a
syntax error at compile time:
use Modern::Perl;
$undeclared_variable = 'Hello, world!';
say $undeclared_variable;
... but works with a minor change:
use Modern::Perl;
use vars '$undeclared_variable';
$undeclared_variable = 'Hello, world!';
say $undeclared_variable;
The equivalent code might be:
use Modern::Perl;
BEGIN
{
require 'vars.pm';
vars->import( '$undeclared_variable' );
}
$undeclared_variable = 'Hello, world!';
say $undeclared_variable;
BEGIN
blocks don't have to do this; they can execute
arbitrary code. use
statements are the primary (if implicit)
source of BEGIN
blocks, however.
The Optree
A Perl 5 optree is a tree of C data structures all deriving from a structure
called OP
. Each op has a name, some flags, and zero or more
children. Ops correspond to Perl 5 operations (called ppcodes) or Perl 5 data
structures (scalars, arrays, hashes, et cetera).
You don't have to know any of this to write good Perl 5 code. You can inspect the optree produced by the parser with the B::Concise module:
$ perl -MO=Concise hello.pl
6 <g@> leave[1 ref] vKP/REFC ->(end)
1 <g0> enter ->2
2 <g;> nextstate(main 61 declorder.pl:5) v:%,*,&,{,$ ->3
5 <g@> say vK ->6
3 <g0> pushmark s ->4
4 <g$> const[PV "Hello, world!"] s ->5
There's a lot of detail in this output, but you can ignore most of it; what
matters is that it uses nesting to represent the tree structure. Thus the top
item (leave
) represents the root of the tree. The numbers in the
leftmost column represent the execution order of the program. The first op
executed is the enter
op. The numbers in the rightmost columns
identify where execution will proceed; enter
leads directly to
nextstate
.
As you may expect, when Perl 5 has finished parsing a BEGIN
block, it begins the execution of the code in that block at the entry point and
resumes parsing at the exit point.
On Bytecode
Language implementations such as Rakudo Perl 6 (or anything else built on Parrot, for that matter) also build up tree structures representing the program, but they don't execute the tree structure directly. They produce bytecode, which is a stream of instructions.
The effect is similar, but instead of serializing and restoring C data structures, the bytecode format has a design conducive to serialization and restoration. You can execute bytecode one set of instructions at a time rather than inflating the whole structure into memory. (I lie a little bit here; some bytecode strategies require building some data structures before execution, but you get the point.)
Perl 5 has an experimental compiler backend in the form of the
B::*
modules intended to give access to the compiler and optree
from Perl programs themselves. There have been attempts to serialize the
optree and restore it later, but it's never worked well. Perl 5's execution
model makes this difficult.
Infrequently Asked Questions
Wait, so Perl 5 doesn't interpret every statement as it parses it?
Not at all, unless you're running a Perl 5 REPL or the debugger.
How does Perl 5 handle ambiguous syntactic constructs then, if it doesn't resolve them at runtime?
The example given in "Perl 5 is not Deterministically and Statically Parseable in All Cases" has unambiguous parses -- if you can execute code in BEGIN
blocks. Don't get hung up on the word "statically". The Perl 5 parser as a wealth of information available.
If it really can't make sense of an ambiguous syntactic construct, it'll give a warning and try to continue or give a syntax error, depending on the serverity.
How much work would it take to make Perl 5 use bytecode instead of an optree?
Lots -- many developer-years of refactoring (and likely a few deprecation cycles to migrate Perl 5 extensions to a better-encapsulated API) might do it.
But Python 2.x always produces the same parse tree for cases where it doesn't know if a symbol is a function name or a variable!
That's because Python's bytecode doesn't distinguish between the two at compile time; it prefers to try an operation at run time and give an error there. (Before you write angry comments saying "Python is strongly typed and Perl 5 is weakly typed and your lawn is ugly, you big ninny!", let me give a few disclaimers. First, I don't know what Python 3.x does. I've only checked Python 2.6.2. Second, "strong typing" doesn't mean much of anything. Third, sigils give syntactic hints even to static parsers. Fourth, language designers and implementers prioritize different things. Perl tries to give really good error messages; improving error messsages is a priority for Perl 6. That's not to say that Python doesn't care about error messages, but that distinguishing between container types at compile time gives Perl certain advantages here.)
When does the Modern Perl book come out?
I hope to publish it in November 2009. Please join the fun by reviewing it and making suggestions.
That was a very interesting read. Thank you for taking the time and effort to write something so clear and informative!