How a Perl 5 Program Works

In the discussions which prompted me to write On Parsing Perl 5, I've read many misconceptions of how Perl 5 works.

The strangest example is a comment on Lambda the Ultimate which contains an incorrect suggestion that Perl 5 subroutines take the source code of the program as an argument to resolve ambiguous parsing.

Someone elsewhere gave the example that Perl gurus preface answers to the question "Is Perl 5 interpreted or compiled?" with "It depends." (Part of the reason for that is that Larry himself often prefaces his answers to all sorts of questions with "It depends.")

Perl 5's execution model isn't quite the same as a traditional compiler (whatever that means) and it's definitely not the same as the traditional notion of an interpreter. There are two distinct phases of execution of a Perl 5 program: compile time and runtime. You must understand the difference to take full advantage of Perl 5.

Compile Time

The compilation stage of a Perl 5 program resembles the compilation stage as you may have learned it in a compilers class. A lexer analyses source code, producing individual tokens. A parser analyses patterns of tokens to build a tree structure representing the operations of the program and to produce syntax errors and warnings. An optimizer prunes and rebuilds the tree for efficiency and consistency.

Unlike the compilation model you may expect from a language implementation which produces a serialized compilation artifact (think of a C compiler producing a .o file, for example, or `javac` emitting a .class file), Perl 5 stores this data structure in memory only. That's one way in which Perl 5 differs from other language implementations; it manages the artifacts of compilation itself.

Certain operations happen only at compilation time -- looking up function names where possible, binding lexical (`my`) variables to lexical pads, entering global symbols into symbol tables. One common error which confuses novices is not realizing that `my` declarations have compile time effects while assignments have runtime effects.

Runtime

After Perl 5 has produced its tree -- the optree -- it begins executing the program by traversing the optree in execution order. Even though the tree structure is a tree for ease of representing operations, execution does not start at the root of the tree and proceed leafward. At this point in the program, the source code is gone.

Of course, certain runtime operations such as the eval STRING operator or require can begin a new, limited compilation time -- but they have no effect on source code already parsed into the optree. This is important.

Executing Code During Compile Time

One of the difficulties in parsing Perl 5 code statically is that one Perl 5 linguistic construct executes code during compile time. The BEGIN block executes as soon as Perl 5 has successfully parsed it. (See perldoc perlmod for more information.)

Because BEGIN temporarily suspends compilation, it can manipulate the environment used by the parser to affect how the parser will treat subsequent source code. An easy example is the case of importing symbols from an external module:

use strict;

The Perl 5 parser treats use statements as if you'd written something like:

BEGIN
{
    require 'strict.pm';
    strict->import();
}

As soon as the parser reaches the semicolon, it executes this code. This causes perl to try to load strict.pm, compile it, and then call its import() method. Within that method, the strict module modifies lexically scoped hints, some of which cause the parser to require declarations of variables and barewords.

Other modules can insert subroutines and variables into the calling package's symbol table; the vars pragma does this for package global variable declarations.

When the BEGIN block ends successfully, the parser resumes at the point where it left off. If the environment of the parse has changed, subsequent parsing may behave differently. This is why this program gives a syntax error at compile time:

use Modern::Perl;

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

... but works with a minor change:

use Modern::Perl;

use vars '$undeclared_variable';

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

The equivalent code might be:

use Modern::Perl;

BEGIN
{
    require 'vars.pm';
    vars->import( '$undeclared_variable' );
}

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

BEGIN blocks don't have to do this; they can execute arbitrary code. use statements are the primary (if implicit) source of BEGIN blocks, however.

The Optree

A Perl 5 optree is a tree of C data structures all deriving from a structure called OP. Each op has a name, some flags, and zero or more children. Ops correspond to Perl 5 operations (called ppcodes) or Perl 5 data structures (scalars, arrays, hashes, et cetera).

You don't have to know any of this to write good Perl 5 code. You can inspect the optree produced by the parser with the B::Concise module:

$ perl -MO=Concise hello.pl
6  <g@> leave[1 ref] vKP/REFC ->(end)
1     <g0> enter ->2
2     <g;> nextstate(main 61 declorder.pl:5) v:%,*,&,{,$ ->3
5     <g@> say vK ->6
3        <g0> pushmark s ->4
4        <g$> const[PV "Hello, world!"] s ->5

There's a lot of detail in this output, but you can ignore most of it; what matters is that it uses nesting to represent the tree structure. Thus the top item (leave) represents the root of the tree. The numbers in the leftmost column represent the execution order of the program. The first op executed is the enter op. The numbers in the rightmost columns identify where execution will proceed; enter leads directly to nextstate.

As you may expect, when Perl 5 has finished parsing a BEGIN block, it begins the execution of the code in that block at the entry point and resumes parsing at the exit point.

On Bytecode

Language implementations such as Rakudo Perl 6 (or anything else built on Parrot, for that matter) also build up tree structures representing the program, but they don't execute the tree structure directly. They produce bytecode, which is a stream of instructions.

The effect is similar, but instead of serializing and restoring C data structures, the bytecode format has a design conducive to serialization and restoration. You can execute bytecode one set of instructions at a time rather than inflating the whole structure into memory. (I lie a little bit here; some bytecode strategies require building some data structures before execution, but you get the point.)

Perl 5 has an experimental compiler backend in the form of the B::* modules intended to give access to the compiler and optree from Perl programs themselves. There have been attempts to serialize the optree and restore it later, but it's never worked well. Perl 5's execution model makes this difficult.

Infrequently Asked Questions

Wait, so Perl 5 doesn't interpret every statement as it parses it?

Not at all, unless you're running a Perl 5 REPL or the debugger.

How does Perl 5 handle ambiguous syntactic constructs then, if it doesn't resolve them at runtime?

The example given in "Perl 5 is not Deterministically and Statically Parseable in All Cases" has unambiguous parses -- if you can execute code in BEGIN blocks. Don't get hung up on the word "statically". The Perl 5 parser as a wealth of information available.

If it really can't make sense of an ambiguous syntactic construct, it'll give a warning and try to continue or give a syntax error, depending on the serverity.

How much work would it take to make Perl 5 use bytecode instead of an optree?

Lots -- many developer-years of refactoring (and likely a few deprecation cycles to migrate Perl 5 extensions to a better-encapsulated API) might do it.

But Python 2.x always produces the same parse tree for cases where it doesn't know if a symbol is a function name or a variable!

That's because Python's bytecode doesn't distinguish between the two at compile time; it prefers to try an operation at run time and give an error there. (Before you write angry comments saying "Python is strongly typed and Perl 5 is weakly typed and your lawn is ugly, you big ninny!", let me give a few disclaimers. First, I don't know what Python 3.x does. I've only checked Python 2.6.2. Second, "strong typing" doesn't mean much of anything. Third, sigils give syntactic hints even to static parsers. Fourth, language designers and implementers prioritize different things. Perl tries to give really good error messages; improving error messsages is a priority for Perl 6. That's not to say that Python doesn't care about error messages, but that distinguishing between container types at compile time gives Perl certain advantages here.)

When does the Modern Perl book come out?

I hope to publish it in November 2009. Please join the fun by reviewing it and making suggestions.

Runtime

Executing Code During Compile Time

The Optree

On Bytecode

Infrequently Asked Questions

Tags:

1 Comment

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry