August 2009 Archives

Designing for the Unanticipated

By chromatic on August 31, 2009 12:24 PM | 2 Comments

In 1987, not even Larry Wall could have predicted the Web. Perl's first wave of adoption was system administration. Perl's second wave of adoption was as a language for CGI programming.

My goal for Perl 6 is a language for general-purpose programming.

One of the design goals for Perl 5 was to remove arbitrary distinctions. Perl 4 was embeddable, but extending it for Oracle or Sybase or access to other database systems meant recompiling the Perl 4 binary. This is why you still occasionally see references to oraperl or sybperl.

Perl 5's greatest benefit is the development of the CPAN. That's not necessarily something Larry could have predicted in 1993 -- and certainly not in 1987 -- but designing Perl 5 for extensibility made the CPAN possible.

Fifteen years after the release of Perl 5.000, the Perl 5 community is still discovering ways to improve CPAN and the Perl 5 extension process.

That's normal. That's expectable. As a language designer, the best you can hope for is to encourage this kind of evolution and revolution. Mostly you have to hope that none of your design decisions actively prevent it. You also have to accept that you won't get it all right the first time.

I've written before that rapid, feedback-aware iteration is an effective way to understand a problem and find working solutions. Feedback from real users with real uses is invaluable (see also why Pugs and, later, Rakudo multiplied the velocity of Perl 6). You won't understand the problem fully until you've had to balance multiple, sometimes contradictory, constraints. Just as no battle plan survives contact with the enemy, no software design survives contact with users unscathed. You may find that you're painting yourself into a corner.

There's a balance: you don't want to cause users unnecessary pain by rapid, arbitrary change, but legacy code and features can hamstring a project.

(Sebastian Riedel's Version Numbers and Backwards Compatibility demonstrates a conundrum: how do you get this feedback early in a project's life when the CPAN is such a compelling distribution mechanism? The CPAN ecosystem provides few tools to manage rapid yet backwards-incompatible releases.)

That balance is especially difficult in language design, where consistency of expression and abstraction and concept is important. 361 contradictory Perl 6 RFCs demonstrate that Perl 5 has real flaws, but a language revision which adopted those RFCs in whole would be a mess.

A vision for the present and future of a software project is important, but that vision must also seek uniformity (where possible) and emphasize differences (where possible). Similar things should look similar and different things should look different. You see this in debates over operator overloading: the C programming language's polymorphic addition operator manages to add integers and floats. You can overload it in C++ to add matrices and irrational numbers. Yet some people rightly complain when someone else overloads it to append to files across a network.

I don't believe anyone really knows how to design a programming language that's sufficiently internally consistent that it's easy to begin, possible to master with minimal pain, well-optimized for its current domain strengths, and ready to take over the Next Big Niche.

I don't know many (successful) general purpose languages which haven't switched their focus more than once. I think "success" for a programming language means that people use it for purposes the designers never anticipated.

That's why discussions about Perl 5 deprecations, for example, frustrate me. I'm sympathetic to the desire not to break existing code (though people always have the option not to upgrade), but I want to remove arbitrary restrictions and smooth off rough edges and make programming tomorrow a little more joyful than programming today.

In other words, bring on Perl 5.11 and 5.12. 5.10.1 isn't the end of the line. I'm not stopping there. Bring on Rakudo Star and Parrot 2.0 and all of the intervening releases. I don't care if the software I rely on isn't "done", whatever that word means in this context. I care if it's useful, usable, and if it continually gets a little bit better based on feedback from real world users.

Sure, there's change involved. Certainly change produces pain sometimes. Yet without change, there's no progress -- and I don't believe we've arrived at the perfect language in which to write perfect software yet.

Who Benefits from the CPAN?

By chromatic on August 28, 2009 2:22 PM | 2 Comments

First: This is neither a complaint nor a criticism. I understand the intent of the CPAN and its goals. I believe it meets those goals effectively.

If you talk to Jarkko about the CPAN, he'll likely tell you that it's primarily a distribution service. It's a series of regularly updated mirrors containing some metadata and an archive of redistributable code. Many proposals for enhancements and replacements and reinventions in other languages have come and gone. Most of them have tried to add complexity to this simple base. That's one reason they haven't succeeded.

This half of the CPAN makes code available to users.

Another half of the CPAN is PAUSE, the service which allows CPAN developers to upload their code to the metadata analysis and code distribution service.

The third half of the CPAN consists of the tools used to find and install CPAN distributions with partial or full automation. It's an optional part of the CPAN experience, but it demonstrates that the CPAN ecosystem also includes tools which rely on the metadata and mirroring services which the CPAN makes available. Without this metadata, the CPAN would be much less useful.

It's also the metadata which allows services such as search.cpan.org (which many people consider the face of the CPAN), RT for CPAN, CPAN Testers, CPANTS, CPAN Ratings, CPAN Forums, and plenty of other services now and in the future.

That's what the CPAN is: a loose federation of sites and services built around a code and metadata mirroring system, with an upload service for registered developers.

Who's It For?

I believe the primary beneficiaries of the CPAN are active CPAN developers.

By uploading your code to the CPAN, you get worldwide mirroring and distribution. You get test results from a wide variety of platforms and versions. You get bug tracking, documentation hosting, reviews, and feedback on the quality and efficacy of your distribution.

You get to push your installation and dependency management to CPAN installers. Because CPAN tools are effective about gathering dependency information and publishing it in a form that other CPAN tools can understand, the easiest way to install distributions from the CPAN is with a CPAN shell such as CPAN.pm or CPANPLUS. Utilities exist for free software distributions such as Debian and Gentoo to wrap CPAN distributions into OS packages where the packaging system can manage them, but they're necessarily specific to individual platforms, where the CPAN shells can run on any operating system where Perl 5 runs.

One strong benefit of the existing CPAN shells is that they run distribution test suites before installation by default, refusing to install when test failures occur. This provides strong pressure to review, report, and fix test failures; the focus is on quality by default.

Active CPAN developers know when and how to report bugs, how to read CPAN Testers reports, and how to force installations. They may know how to use the BackPAN or to use an earlier version of a dependency.

This brings up a subtler feature of the CPAN which optimizes the experience for active CPAN developers: you always get the newest version of a distribution. While a PAUSE/CPAN shell hack allows developers to upload a development version which people cannot install accidentally, there's little ability to specify in dependencies that you want users to install a specific version of a dependency. One accidental upload in any of a dozen distributions could render half of the distributions on the CPAN uninstallable.

In some ways, this feature creates and exacerbates a problem. It can be difficult to bundle a distribution and all of its dependencies as the dependency graph can change during the bundling process.

A CPAN for Normal Users

What would a CPAN look like for normal users? ActiveState's PPM isn't a bad model in some ways, though it hews too closely to the CPAN itself in others. Binary repositories for Linux distributions have other advantages. I can think of several attributes of a CPAN enhancement for non-developers:

Binary distributions, or at least not requiring the presence of a C or C++ compiler and make utilities. This could be optional.
Run the tests on installation for verification and reporting purposes. This could also be optional, but I like the quality-by-default approach.
Bundling a distribution and all of its dependencies into a single, installable package.
Automatic relocation (perhaps through the use of local::lib or something similar) to allow multiple versions of a single distribution installed and usable.
Regular, tested updates to bundles and the contained dependency graphs.
Working with upstream.
Integration with OS packages.

The latter two I have no good ideas how to accomplish. Working with upstream can be difficult in the normal case; not everyone looks at CPAN Testers reports or the CPAN's RT or other CPAN extensions. Building OS packages seems like a lot of trouble and a lot of duplicate work.

Even so, the Perl 5 ecosystem already has most of the tools necessary to build such a thing. We can build a dependency graph for most CPAN distributions, and we can identify those without accurate graphs. We can calculate the likelihood of tests passing on various Perl 5 versions and platforms given that graph. It only takes a little bit of code to bundle most graphs into a dependency-first installable bundle, and a small loader module could set @INC paths appropriately.

Given a list of dependencies, it's possible to analyze the potential graphs for solutions and identify potential points of conflict or failure. If solutions exist, the software could create an installable bundle. Source code is the easiest, but a binary is possible.

It's also possible to keep these graphs and bundles up to date, with a lag of a few hours to a couple of days. Though calculating the possible solutions from a graph may be expensive, most of the information is cacheable.

Would people use such a system? I don't know. Should it replace elements of the current CPAN system? Never; it addresses a different purpose. Is it worth building? The idea continues to tickle my mind.

Vision and the Perl 5 Ecosystem

By chromatic on August 24, 2009 12:01 PM | 3 Comments

Now that Dave Mitchell has released Perl 5.10.1 the bikeshedding may begin again. David Golden asked What do you want Perl 5 to be? I've suggested that people define their own visions for the future of Perl 5.

I believe this is important for a few reasons:

The Perl 5 community has never agreed collectively in any sense on one or two strategic visions for the language and its ecosystem.
Discussions about practical concerns always flow from ideas -- often implicit -- about the vision of the language and its ecosystem.
Everyone who has a stake in the language and its ecosystem has particular uses in mind which deserve fair hearings.
A project without a strong sense of vision will make progress only on the whims of people dedicated enough to perform work and get it committed.

I admit some bitterness in the last point; I have little desire to participate in arguments on p5p over proposed changes. (I'm not thrilled about excoriation for a personal Git fork of Perl 5 away from p5p.)

Yet I still maintain that all of this debate and discussion and even some of the acrimony comes from conflicting ideas of what Perl 5 is and should be.

A Digression about Multiple Implementations

I watched with some interest a project to port Perl 5 to the JVM. Unfortunately, P5 on the JVM has stalled; the developers believe the problem is currently intractable.

I won't discuss the concerns the developers have raised; that's a discussion for another context.

Yet doesn't it seem strange that Perl 5 is one of the major languages without a port to the JVM or the CLR? How about to Parrot?

Part of the problem is a lack of people knowledgeable about Perl 5 internals with both the time and interest to work on such a project. Hopefully projects such as corehackers will improve things, especially now that the 5.10.1 release has stopped consuming all of the resources of p5p.

Another part of the problem is the internals themselves, but I've mentioned that before.

The final part of the problem is perception. For all that Perl has been a great (some might say promiscuous) glue language for a couple of decades, is it only my perception that the language ecosystem has grown insular? Perhaps that's part of the external perception problem: the Perl 5 world is its own world, with its CPAN and its XS and its quirks, and it leaves the rest of the world to its own devices.

Maybe that's part of the reason Perl 5 bindings are sometimes so difficult to coax out of companies and organizations. Maybe that's why Perl 5 support in new compilers and environments and services is so slow to appear, if it ever appears.

The Vision and the Ecosystem

I wonder if we're falling afoul of a perfectionist/completionist culture, where the next stable Perl 5 release will surely be the point at which we can stop this madness of patches and changes and releases.

Yes, that's a deliberate strawman.

Tell me it's wrong, though, especially in the face of distributions which don't expect people to write modern Perl 5 with the Perl 5 installed by default. (Don't take this as a personal complaint about Red Hat; Apple has had a similar flaw for much, much longer.)

Is it possible that some people see Perl 5 as an appliance, as plumbing? It's the strange contraption under your sink that sometimes makes a knocking noise if you leave the hot water on too long and you certainly hope it doesn't break because you know it'll make a fantastic mess and you might have trouble finding someone who can fix it.

I'm not suggesting that that's true. It might not be accurate.

Yet I wonder if the disconnect between the people who want to upgrade Perl 5 on their boxes once every ten years on a Saturday afternoon between 3:50 and 4:05 pm and the people like me who want to make Perl 5 easier to learn, easier to use correctly, and more powerful in terms of abstraction and portability and optimization is pervasive throughout the community.

Answer me these questions

Is a purpose of CPAN for experiments in language extensions?
Do they converge into proposals for additions to Perl 5 itself? If so, how?
Does the CPAN help or hinder the desire for stability of the once-a-decade upgraders?
Whose needs does the CPAN meet?
Are those needs complementary or contradictory to the goals of various groups?
How does the rest of the CPAN ecosystem support or discourage these goals?

I have my own answers to those questions, but the conversation is more important than me dictating what I think you should think.

However, if you thought Perl 5 version numbers are a mess, you haven't seen the real horror. Consider ratings of the Module::Build distribution, where it's taken nine years to reach the point where some people are satisfied that a pure-Perl 5 installation system can replace a broken system which relies on your ability to use regular expressions to rewrite cross-platform shell scripts (which often need to call back into Perl 5 to emulate missing POSIX utilities) or whether people should use a DSL wrapper around MakeMaker which relies on distribution authors to copy and paste code which may have bugs (and in fact has had a couple, requiring those authors to release new versions of those distributions merely to make them installable).

If the Perl 5 community gets its once-a-decade chance to upgrade a bunch of machines next year, wouldn't it be nice to sneak in one or two Better is Better moments for a change?

The Problems with Indirect Object Notation

By chromatic on August 21, 2009 5:21 PM | 5 Comments

This excerpt from Modern Perl: the book discusses another feature of Perl 5 which makes parsing Perl 5 difficult. Avoiding this feature in your own code will make it more reliable and easier to debug.

Read a few Perl 5 object tutorials (or the documentation of too many CPAN modules), and you might believe that new is a language keyword just as in C++ and Java:

    my $q = new CGI; # DO NOT USE

As objects has made clear, a constructor in Perl 5 is anything which returns an object. By convention, constructors are class methods named new(), but you have the flexibility to choose a different approach to meet your needs. If new() is instead a class method, the standard method call approach should apply:

    my $q = CGI->new();

These syntaxes are equivalent in behavior, except when they're not.

The first form is the indirect object form (more precisely, the dative case), where the verb (the method) precedes the noun to which it refers (the object). This is fine in spoken languages, but it introduces difficult to debug ambiguities in Perl 5.

Bareword indirect invocations

One problem is that the name of the method is a bareword, requiring the Perl 5 parser to perform several heuristics to determine the proper interpretation. While these heuristics are well-tested and almost always correct, their failure modes can be very confusing and difficult to debug. Worse, they're fragile in the face of the order of compilation and module loading.

Parsing is more difficult for humans and the computer when the constructor takes arguments. The Java-style approach may resemble:

    # DO NOT USE
    my $obj = new Class( arg => $value );

... thus making the classname Class look like a subroutine call. Perl 5 can disambiguate many of these cases, but its heuristics depend on which package names the parser has seen so far, which barewords it has already resolved (and how it resolved them), and the names of subroutines already declared in the current package.

Imagine running afoul of a subroutine with prototypes with a name which just happens to conflict somehow with the name of a class or a method called indirectly. This happens infrequently, but it's difficult enough to debug that it's worth making impossible by avoiding this syntax.

Indirect notation scalar limitations

Another danger of the syntax is that the parser expects a single scalar expression as the object. You may have had trouble printing to a filehandle stored in an aggregate variable:

    # DOES NOT WORK AS WRITTEN
    say $config->{output} "This is a diagnostic message!";

print, close, and say -- all keywords which operate on filehandles -- operate in an indirect fashion. This was fine when filehandles were package globals, but with lexical_filehandles the problem can be more apparent, when Perl 5 tries to call the say method on the $config object. The solution is to disambiguate the expression which produces the intended invocant:

    say {$config->{output}} "This is a diagnostic message!";

Alternatives to indirect notation

Direct invocation notation does not suffer this ambiguity problem. To construct an object, call the constructor method on the class name directly:

    my $q   = CGI->new();
    my $obj = Class->new( arg => $value );

For filehandle operations, which are limited, known to the Perl 5 parser directly, and pervasive in their idiomatic use of the dative case, use curly brackets to remove ambiguity about your intended invocant. Alternately, consider loading the core IO::Handle module which allows you to perform IO operations by calling methods on filehandle objects (such as lexical filehandles).

To identify indirect calls in your code, use the CPAN module Perl::Critic::Policy::Dynamic::NoIndirect (a plugin for Perl::Critic). To forbid their use at compile time, use the CPAN module indirect.

The Problem with Prototypes

By chromatic on August 20, 2009 1:13 AM

In How a Perl 5 Program Works and On Parsing Perl 5, I mentioned ways to manipulate the Perl 5 parser as it executes. The easiest way to do so is through the use of Perl 5 subroutine prototypes. This early draft excerpt from Modern Perl: the book explains the good and the bad of prototypes and when they're a good idea in modern Perl code.

Perl 5's prototypes serve two purposes. First, they're hints to the parser to change the way it parses subroutines and their arguments. Second, they change the way Perl 5 handles arguments to those subroutines when it executes them. A common novice mistake is to assume that they serve the same language purpose as subroutine signatures in other languages. This is not true.

To declare a subroutine prototype, add it after the name:

    sub foo        (&@);
    sub bar        ($$) { ... }
    my  $baz = sub (&&) { ... };

You may add prototypes to subroutine forward declarations. You may also omit them from forward declarations. If you use a forward declaration with a prototype, that prototype must be present in the full subroutine declaration; Perl will give a prototype mismatch warning if not. The converse is not true: you may omit the prototype from a forward declaration and include it for the full declaration.

The original intent of prototypes was to allow users to define their own subroutines which behaved like (certain) built-in operators. For example, consider the behavior of the push operator, which takes an array and a list. While Perl 5 would normally flatten the array and list into a single list at the call site, the Perl 5 parser knows that a call to push must effectively pass the array as a single unit so that push can operate on the array in place.

The prototype operator takes the name of a function and returns a string representing its prototype, if any, and undef otherwise. To see the prototype of a built-in operator, use the CORE:: form:

    $ perl -E "say prototype 'CORE::push';"
    \@@

As you might expect, the @ character represents a list. The backslash forces the corresponding argument to become a reference to that argument. Thus mypush might be:

    sub mypush (\@@)
    {
        my ($array, @rest) = @_;
        push @$array, @rest;
    }

Valid prototype characters include $ to force a scalar argument, % to mark a hash (most often used as a reference), and & which marks a code block. The fullest documentation is available in perldoc perlsub.

The Problem with Prototypes

The main problem with prototypes is that they behave differently than most people expect when first encountering them. Prototypes can change the parsing of subsequent code and they can coerce the types of arguments. They don't serve as documentation to the number or types of arguments subroutines expect, nor do they map arguments to named parameters.

Prototype coercions work in subtle ways, such as enforcing scalar context on incoming arguments:

    sub numeric_equality($$)
    {
        my ($left, $right) = @_;
        return $left == $right;
    }

    my @nums = 1 .. 10;

    say "They're equal, whatever that means!" if numeric_equality @nums, 10;

... and not working on anything more complex than simple expressions:

    sub mypush(\@@);

    # XXX: prototype type mismatch syntax error
    mypush( my $elems = [], 1 .. 20 );

Those aren't even the subtler kinds of confusion you can get from prototypes; see Far More Than Everything You've Ever Wanted to Know About Prototypes in Perl for a dated but enlightening explanation of other problems.

Good Uses of Prototypes

As long as code maintainers do not confuse them for full subroutine signatures, prototypes have a few valid uses.

First, they are often necessary to emulate and override built-in operators with user-defined subroutines. As shown earlier, you must first check that you can override the built-in operator by checking that prototype does not return undef. Once you know the prototype of the operator, use the subs pragma to declare that you want to override a core operator:

    use subs 'push';

    sub push (\@@) { ... }

Beware that the subs pragma is in effect for the remainder of the file, regardless of any lexical scoping.

The second reason to use prototypes is to define compile-time constants. A subroutine declared with an empty prototype (as opposed to an absent prototype) which evaluates to a single expression becomes a constant in the Perl 5 optree rather than a subroutine call:

    sub PI () { 4 * atan2(1, 1) }

The Perl 5 parser knows to substitute the calculated value of pi whenever it encounters a bareword or parenthesized call to PI in the rest of the source code (with respect to scoping and visibility).

Rather than defining constants directly, the core constant pragma handles the details for you and may be clearer to read. If you want to interpolate constants into strings, the Readonly module from the CPAN may be more useful.

The final reason to use a prototype is to extend Perl's syntax to operate on anonymous functions as blocks. The CPAN module Test::Exception uses this to good effect to provide a nice API with delayed computation. This sounds complex, but it's easy to explain. The throws_ok() subroutine takes three arguments: a block of code to run, a regular expression to match against the string of the exception, and an optional description of the test. Suppose that you want to test Perl 5's exception message when attempting to invoke a method on an undefined value:

    use Test::More tests => 1;
    use Test::Exception;

    throws_ok
        { my $not_an_object; $not_an_object->some_method() }
        qr/Can't call method "some_method" on an undefined value/,
        'Calling a method on an undefined invocant should throw exception';

The exported throws_ok() subroutine has a prototype of &$;$. Its first argument is a block, which Perl upgrades to a full-fledged anonymous function. The second requirement is a scalar. The third argument is optional.

The most careful readers may have spotted a syntax oddity notable in its absence: there is no trailing comma after the end of the anonymous function passed as the first argument to throws_ok(). This is a quirk of the Perl 5 parser. Adding the comma causes a syntax error. The parser expects whitespace, not the comma operator.

You can use this API without the prototype. It's slightly less attractive:

    use Test::More tests => 1;
    use Test::Exception;

    throws_ok(
        sub { my $not_an_object; $not_an_object->some_method() },
        qr/Can't call method "some_method" on an undefined value/,
        'Calling a method on an undefined invocant should throw exception');

A sparing use of subroutine prototypes to remove the need for the sub keyword is reasonable. Few other uses of prototypes are compelling enough to overcome their drawbacks.

How a Perl 5 Program Works

By chromatic on August 17, 2009 10:28 PM | 1 Comment

In the discussions which prompted me to write On Parsing Perl 5, I've read many misconceptions of how Perl 5 works.

The strangest example is a comment on Lambda the Ultimate which contains an incorrect suggestion that Perl 5 subroutines take the source code of the program as an argument to resolve ambiguous parsing.

Someone elsewhere gave the example that Perl gurus preface answers to the question "Is Perl 5 interpreted or compiled?" with "It depends." (Part of the reason for that is that Larry himself often prefaces his answers to all sorts of questions with "It depends.")

Perl 5's execution model isn't quite the same as a traditional compiler (whatever that means) and it's definitely not the same as the traditional notion of an interpreter. There are two distinct phases of execution of a Perl 5 program: compile time and runtime. You must understand the difference to take full advantage of Perl 5.

Compile Time

The compilation stage of a Perl 5 program resembles the compilation stage as you may have learned it in a compilers class. A lexer analyses source code, producing individual tokens. A parser analyses patterns of tokens to build a tree structure representing the operations of the program and to produce syntax errors and warnings. An optimizer prunes and rebuilds the tree for efficiency and consistency.

Unlike the compilation model you may expect from a language implementation which produces a serialized compilation artifact (think of a C compiler producing a .o file, for example, or `javac` emitting a .class file), Perl 5 stores this data structure in memory only. That's one way in which Perl 5 differs from other language implementations; it manages the artifacts of compilation itself.

Certain operations happen only at compilation time -- looking up function names where possible, binding lexical (`my`) variables to lexical pads, entering global symbols into symbol tables. One common error which confuses novices is not realizing that `my` declarations have compile time effects while assignments have runtime effects.

Runtime

After Perl 5 has produced its tree -- the optree -- it begins executing the program by traversing the optree in execution order. Even though the tree structure is a tree for ease of representing operations, execution does not start at the root of the tree and proceed leafward. At this point in the program, the source code is gone.

Of course, certain runtime operations such as the eval STRING operator or require can begin a new, limited compilation time -- but they have no effect on source code already parsed into the optree. This is important.

Executing Code During Compile Time

One of the difficulties in parsing Perl 5 code statically is that one Perl 5 linguistic construct executes code during compile time. The BEGIN block executes as soon as Perl 5 has successfully parsed it. (See perldoc perlmod for more information.)

Because BEGIN temporarily suspends compilation, it can manipulate the environment used by the parser to affect how the parser will treat subsequent source code. An easy example is the case of importing symbols from an external module:

use strict;

The Perl 5 parser treats use statements as if you'd written something like:

BEGIN
{
    require 'strict.pm';
    strict->import();
}

As soon as the parser reaches the semicolon, it executes this code. This causes perl to try to load strict.pm, compile it, and then call its import() method. Within that method, the strict module modifies lexically scoped hints, some of which cause the parser to require declarations of variables and barewords.

Other modules can insert subroutines and variables into the calling package's symbol table; the vars pragma does this for package global variable declarations.

When the BEGIN block ends successfully, the parser resumes at the point where it left off. If the environment of the parse has changed, subsequent parsing may behave differently. This is why this program gives a syntax error at compile time:

use Modern::Perl;

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

... but works with a minor change:

use Modern::Perl;

use vars '$undeclared_variable';

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

The equivalent code might be:

use Modern::Perl;

BEGIN
{
    require 'vars.pm';
    vars->import( '$undeclared_variable' );
}

$undeclared_variable = 'Hello, world!';
say $undeclared_variable;

BEGIN blocks don't have to do this; they can execute arbitrary code. use statements are the primary (if implicit) source of BEGIN blocks, however.

The Optree

A Perl 5 optree is a tree of C data structures all deriving from a structure called OP. Each op has a name, some flags, and zero or more children. Ops correspond to Perl 5 operations (called ppcodes) or Perl 5 data structures (scalars, arrays, hashes, et cetera).

You don't have to know any of this to write good Perl 5 code. You can inspect the optree produced by the parser with the B::Concise module:

$ perl -MO=Concise hello.pl
6  <g@> leave[1 ref] vKP/REFC ->(end)
1     <g0> enter ->2
2     <g;> nextstate(main 61 declorder.pl:5) v:%,*,&,{,$ ->3
5     <g@> say vK ->6
3        <g0> pushmark s ->4
4        <g$> const[PV "Hello, world!"] s ->5

There's a lot of detail in this output, but you can ignore most of it; what matters is that it uses nesting to represent the tree structure. Thus the top item (leave) represents the root of the tree. The numbers in the leftmost column represent the execution order of the program. The first op executed is the enter op. The numbers in the rightmost columns identify where execution will proceed; enter leads directly to nextstate.

As you may expect, when Perl 5 has finished parsing a BEGIN block, it begins the execution of the code in that block at the entry point and resumes parsing at the exit point.

On Bytecode

Language implementations such as Rakudo Perl 6 (or anything else built on Parrot, for that matter) also build up tree structures representing the program, but they don't execute the tree structure directly. They produce bytecode, which is a stream of instructions.

The effect is similar, but instead of serializing and restoring C data structures, the bytecode format has a design conducive to serialization and restoration. You can execute bytecode one set of instructions at a time rather than inflating the whole structure into memory. (I lie a little bit here; some bytecode strategies require building some data structures before execution, but you get the point.)

Perl 5 has an experimental compiler backend in the form of the B::* modules intended to give access to the compiler and optree from Perl programs themselves. There have been attempts to serialize the optree and restore it later, but it's never worked well. Perl 5's execution model makes this difficult.

Infrequently Asked Questions

Wait, so Perl 5 doesn't interpret every statement as it parses it?

Not at all, unless you're running a Perl 5 REPL or the debugger.

How does Perl 5 handle ambiguous syntactic constructs then, if it doesn't resolve them at runtime?

The example given in "Perl 5 is not Deterministically and Statically Parseable in All Cases" has unambiguous parses -- if you can execute code in BEGIN blocks. Don't get hung up on the word "statically". The Perl 5 parser as a wealth of information available.

If it really can't make sense of an ambiguous syntactic construct, it'll give a warning and try to continue or give a syntax error, depending on the serverity.

How much work would it take to make Perl 5 use bytecode instead of an optree?

Lots -- many developer-years of refactoring (and likely a few deprecation cycles to migrate Perl 5 extensions to a better-encapsulated API) might do it.

But Python 2.x always produces the same parse tree for cases where it doesn't know if a symbol is a function name or a variable!

That's because Python's bytecode doesn't distinguish between the two at compile time; it prefers to try an operation at run time and give an error there. (Before you write angry comments saying "Python is strongly typed and Perl 5 is weakly typed and your lawn is ugly, you big ninny!", let me give a few disclaimers. First, I don't know what Python 3.x does. I've only checked Python 2.6.2. Second, "strong typing" doesn't mean much of anything. Third, sigils give syntactic hints even to static parsers. Fourth, language designers and implementers prioritize different things. Perl tries to give really good error messages; improving error messsages is a priority for Perl 6. That's not to say that Python doesn't care about error messages, but that distinguishing between container types at compile time gives Perl certain advantages here.)

When does the Modern Perl book come out?

I hope to publish it in November 2009. Please join the fun by reviewing it and making suggestions.

On Parsing Perl 5

By chromatic on August 14, 2009 12:52 PM | 5 Comments

For some reason, a PerlMonks post from last year, Perl Cannot Be Parsed, has resurfaced in online discussions.

As you might expect, most comments are ill informed and blatant in their wrongness.

At least half of the comments are the equivalent of "OH noes i cantn't read PERL LOLLERS now i is nos Y!!!", but that's what you get for reading Twitter, Digg, YouTube, and occasionally Reddit. (If no one's invented the phrase "The Eternal 4channing of the Internet", I claim it.)

Another significant portion of the comments come from people who've read only the headline and feel sufficiently enlightened that they can proclaim their wisdom to the world at large, despite comments and clarifications on that PerlMonks post from people who've actually worked on the Perl 5 internals and understand the implications of that statement and their limitations.

A few comments are from people who know what this means and what it implies and those (few) cases where it applies.

I suspect the rest of the programming world which might care persists in a limited state of confusion. What does this mean? Why does it matter? What happens now? This is especially true for people who use Perl but don't necessarily consider themselves programmers. They don't have computer science backgrounds, but they'd use good tools like Perl::Critic if they knew they were available.

On Static Parsing

The unparsability of Perl 5 is a static unparsability. "Static" is a specific term of art which means, basically, "You can't run any part of the program to determine what it means."

One of the most common points of confusion in the discussion is a misunderstanding of what "static" means. This doesn't mean that it's impossible to parse Perl 5. The files perly.y and perltoke.c in the Perl 5 source tree do just that. However, the existence of subroutine prototypes and bareword declarations combined with the possibility of running arbitrary code during compilation with BEGIN blocks introduces ambiguity to the process.

That's all the proof addresses.

I realize that's not obvious to everyone reading this so far. That's fine. Just remember that the proof addresses important but limited circumstances.

Why limited? It is possible to parse Perl 5 code statically in many cases -- most cases, even. PPI is an example of this. PPI can build a document object model representing Perl 5 programs sufficient to perform static analysis; this is why Perl::Critic works.

PPI isn't perfect, and it can't resolve all possible Perl 5 programs to unambiguous object models. Part of that is the malleability of Perl 5 syntax.

Changing the Parse

I mentioned subroutine prototypes and barewords; these are two cases where predeclaration changes how Perl 5 will parse existing code.

Subroutine prototypes can change the expected arity and parse and runtime behavior of subroutines. In general, you should avoid them unless you know exactly what this means and what it implies.

Changing Expectations

In specific circumstances, they're useful to change how Perl 5 passes arguments to the functions. My favorite prototype lets you pass blocks as anonymous subs:

sub wrap (&) { ... }

my $wrapped = wrap { do_this(); do_that() };

Without the & prototype character, you'd have to call wrap() with more verbosity:

my $wrapped = wrap( sub { ... } );

It's a small prettification, but it's occasionally useful.

There are two important points to note about this example. First, Perl 5 must encounter the sub declaration with its prototype prior to the point at which it parses the call to wrap. The prototyped declaration changes how Perl 5 parses that construct -- otherwise the curly braces would run through the "is it a hash reference or a block?" heuristic and would not get promoted to an anonymous subroutine declaration.

Second, a smart and stateful static parser could identify this type of declaration and parse the code the same way the Perl 5 parser does. This is not by itself sufficient to make parsing Perl 5 statically indeterminate.

(Some people might object that statefulness in a static parser violates some rule of computer theory somewhere, which is a true point but practically useless; you need at least some statef in your parser to find the end of a double-quoted string if you allow escaped embedded double quotes. The question is not if you allow lookahead but how much lookahead you need. Also parser people, remember that I reserve the right to tell small fibs to explain things to an audience that won't ever take a class about automata.)

Changing Arity

Arity can be more interesting; this is a fancy programming term which means "how many arguments a function or operator takes". Perl 5 subroutines are usually greedy; they'll happily slurp up all arguments in a big list. Certain subroutine prototypes allow you to change this. For example, you can define a foo subroutine which only takes two arguments:

sub foo ($$)
{
    return "<@_>";
}

say foo 1, 2;

The prototype changes the parsing of foo calls so that providing more than two arguments is a syntax error:

say 'Parsing done';
say foo 1, 2, 3;
$ perl protos.pl
Too many arguments for main::foo at protos.pl line 13, near "3;"
Execution of protos.pl aborted due to compilation errors.

Again, Perl 5 must have encountered the prototyped declaration before it can apply the parsing rule changes to any code. Again, a clever and stateful static parser could detect these parser changes and adapt to them.

Barewords

The Perl 5 parser has a complex heuristic for figuring out what a bareword actually means. It could be a filehandle, a class name, a function, or more. Several good Perl 5 programmers recommend against the indirect object invocation syntax (though dative is a description with more technical accuracy for linguists) because it can parse differently depending on code Perl 5 has already executed.

my $obj = new Class foo => 'bar'; # do not mimic this example

Perl 5 has heuristics to determine what new and Class mean in this context. It may be obvious to readers that this code really means:

my $obj = Class->new( foo => 'bar' );

... but if the parser has already seen a subroutine declaration for Class in this particular namespace (in certain circumstances, I believe it's possible to fool the parser with a specific way of declaring new), the heuristic may choose the wrong option.

The indirect pragma is an effective way to root out this ambiguity from your code. There are some seven cases in which you can run afoul of this bareword heuristic; you can tell who's a real Perl 5 guru by asking him or her to list at least five. See Matt S. Trout's Indirect but Still Fatal for a fuller discussion of the issues here.

A very clever, very careful static parser could still replicate the complex heuristic in the Perl 5 parser and parse code appropriately.

The Real Problem

What's the source of the ambiguous parsing issue?

If Perl 5 requires predeclaration of all parse modifications, and if they're declarative, why is writing a static parser which can parse all Perl 5 programs in the same way that the Perl 5 parser does impossible?

Perl 5's compilation phase parses Perl 5 code into a parse tree. The execution phase traverses this tree. They're two separate stages...

... except that BEGIN and implicit BEGIN blocks can run arbitrary code during the compilation phase.

That is, you can write code that will not compile successfully:

sub foo ($$)
{
    return "<@_>";
}

say foo 1, 2, 3;

... and insert a BEGIN block to change the way Perl 5 will parse subsequent code:

sub foo ($$)
{
    return "<@_>";
}

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub foo { return "<@_>" }'
}

say foo 1, 2, 3;

Even though the eval code wouldn't normally run until after the compilation stage has finished and the execution stage has begin, the BEGIN block causes the Perl 5 parser to execute its contents immediately after its parsing has finished -- before parsing the very next line.

This is a contrived example -- why would anyone write this code? -- but you should start to see the problem.

Even though this syntax is declarative to the Perl 5 parser, a static analysis tool would have to analyze the semantic meaning of code within the BEGIN block to determine if it has any effects. That means peeking into double-quoted strings used as arguments to eval.

Even if a static parser could do that, it'd be easy to confuse it again:

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub' . ' foo ' . '{ return "<@_>" }'
}

Even though Perl 5 can coalesce those three concatenations to a single constant string during its optimization phase, the parser must get more complex to detect this case.

You can see where this is going.

Ultimately, completely accurate static parsing of Perl 5 programs is impossible in certain cases because I could write:

BEGIN
{
    no warnings qw( prototype redefine );
    eval 'sub' . ' foo ' . '{ return "<@_>" }'
        if rand( 1 ) > 0.5;
}

The static analysis tool can't even tell if this program would parse successfully. About half the time, Perl 5 will report that it has a syntax error. Half of the time it will succeed.

Thus there exist pathological cases written by curious (and potentially malicious) people such as myself where a static Perl 5 parser cannot predict what the actual Perl 5 parser will do. If you're morbidly curious, consider changing the prototype of a symbol already parsed through raw typeglob access.

The Implications

Don't panic. Static analysis tools can't parse all valid Perl 5 programs, but they can parse many Perl 5 programs. While the Perl 5 parser is complex with many special cases, it's possible to write a clever tool such as PPI which is good enough for almost all code. Besides that, the language-bending manipulations that could trip up a static analysis tool are mostly declarative and mostly decidable, unless you're doing something beyond weird.

In most cases this doesn't matter. That example code is completely unreliable. There are ways to disambiguate potentially ambiguous code -- and good Perl 5 hackers recommend using those approaches anyway. Most languages without a very strict homoiconic representation have parsing ambiguities (and if you allow arbitrary code execution in templates or reader macros, even homoiconicity can't always save you).

Besides that, once you allow loading language extensions from languages with access to raw memory, all bets are off.

With a bit of work, I'm sure I could come up with ways to confuse static Ruby and Python parsers without resorting to C. Implicit variable declarations seems like one perilous approach....

Why The Oyster Farming Book Market Crashed

By chromatic on August 12, 2009 2:32 PM | 8 Comments

One of the suggested objectives in Gabor Szabo's Measurable objectives for the Perl ecosystem is "Increase the number of Perl book sales". I like most of the objectives in Gabor's post, but I must caution against taking the numbers presented too seriously. At best, they're incomplete. At worse, they're completely misleading.

Gabor refers to State of the Computer Book Market - Mid-Year 2009, written by Mike Hendrickson. Mike's company publishes similar analyses a few times a year, based on sales data from Nielsen Bookscan. (For more on the culture of Bookscan rankings in the publishing world, see Why writers never reveal how many books their buddies have sold.)

This data sounds wonderful and the pretty graphs and charts give you the impression that you're getting useful information. Yet this is only a picture of the market. As Mike writes later in the piece:

Many publishers report that more than 50% of their revenue is achieved as direct sales, and those numbers do not get reported into Bookscan. Sales at traditional college bookstores are typically not reported into Bookscan as well. Again this is US Retail Sales data recorded at the point of sale to a consumer.

These numbers reflect less than half of revenue. Throw out half of sales by dollar and hope that the results are stochastic.

Yet there's a deeper flaw behind these numbers.

How Book Sales Work

If you plot the sales curve of multiple books, you'll notice that they tend to follow the ubiquitous power law. A book sells as many copies in its first three months as it will the rest of the first year. A book sells half as many copies in the second year as it did in the first. This model is so accurate that the publishing industry calls titles "frontlist" titles if they're in their first year of publishing and "backlist" titles if they're not.

While a few titles have strong backlist sales, they're rare. They're the Bibles and Harry Potters and How to Win Friends and Influence Peoples. The publishing industry's san greal is to find a new strong backlist bestseller.

They tend to exhibit strong frontlist behavior as well.

The retailer's point of view is different. Limited shelf space means that new books still in their three-six-twelve month short snout sales levels often get priority over older books in the long tail sales levels. If you're going to see 3000 copies of a book in the first three months and 1000 copies in the next three years, stock up early.

This is especially true in technical book publishing, where I have trouble giving away Python 2.3 and Oracle 7 and Red Hat Linux 6 books. Publishing dates are expiration dates: best by a year after the copyright date.

Why does this matter? It's a simple matter of economics: people won't buy books you don't publish.

The Freshness Factor

2005 was a good year for Perl book sales. Why? Four strong Perl books came out in 2005. The Perl book sales numbers for that year reflected the short snout of Perl book sales.

Four years later, is PBP selling as many copies? Is the Perl Testing book? Is HOP? Is APP 2e?

Those are rhetorical questions. You already know the answer. You can even answer that question for the Camel 3e. A book published in 2000 may still be useful nine years later, but Camel 3e predates almost every part of the Perl Renaissance. Besides that, the 250k or 300k units already sold have reached a fair amount of the Perl 5 programming market.

Compare that with the Ruby book market in 2006, where you couldn't leave an editorial board meeting without an assignment to publish a new Ruby or Rails book. Initial sales numbers looked great; the growth in that market segment was huge!

Did any Ruby book sell 250k copies, though? That number's missing from the year-by-year analysis.

Look at this year's numbers. Objective-C is huge! It's 1999 all over again! Except that, yet again, the comparison is to an emerging market segment without analysis of historical trends.

The Missing Data

The biggest piece of data obviously missing from these State of the Computer Book Market entries is historical context. Six months or a year of appositional data comparing different market segment maturities is misleading, at beast. Should you go learn Objective-C just because Bookscan reported more Objective-C titles sold than SQL?

No -- but to be fair, Mike doesn't suggest this directly.

Other missing data is more subtle, and perhaps more meaningful. Where's the breakdown of frontlist/backlist for these sales figures? More than nine out of ten books follow the power law I described earlier. If the Objective-C books have all come out in the past year, they're in their short snout period. Of course they're selling more units now than books in the long tail period.

How many total units does the market represent? If the number of books sold in 2009 is half the number sold in 2008, it's difficult to compare the performance of books against each other year-over-year. There are too many other factors to consider. (You can still get interesting information, but you can't compare technologies against each other in meaningful appositive ways.)

How many books are in each category? Title efficiency (average number of unit sales per title and standard deviation) can tell other interesting stories. Is one language category hit driven (iPhone Programming, Ruby on Rails)? Are there niche subjects intended as modest sales targets and not bestsellers? Is every book a moderate success, with no breakout quintessential must-have tome? Is there a gold rush of publishing with 40 new titles produced in a year and each of them selling a dismal 1000 copies apiece?

How many new books are in a market segment this year compared to last year? This is the biggest question that matters to Perl books, especially with regard to Gabor's suggestion. Again, this should be obvious: no one can buy Camel 4e right now.

A Completely Hypothetical Fictional Example I Made Up Completely From Whole Cloth

If that didn't convince you, consider a short fable about oyster farming.

Suppose you own a publishing company. Suppose you discover a new topic area: oyster farming. No one's published on this topic before, but hundreds of thousands of people are doing it. There's a lot of institutional knowledge, but there's a ripe opportunity for documenting best practices and nuanced techniques -- especially given that you have found the person who invented modern oyster farming and convinced him to write a book about it.

You publish the book. It takes off. Its short snout is wide. (My metaphor is awkward.) You've discovered a new market segment; you've invented a new market segment. Life is grand.

You branch out. You publish More Oyster Farming and Learn to Farm Oysters and Pteriidae, Reefs, Bivalves, and Mollusks. You even write a cookbook for Oysters.

Then a catastrophic triploid spawning accident removes the long-beloved MSX resistance in most commercial oyster farms, ruining the market for a year -- maybe longer -- and in a panic you cancel all of your upcoming frontlist titles.

A few other publishers publish one- or two-off titles in the market segment. They sell a few copies. You had a corner on the market though. You were the publishing world's China of oyster farming. Over the next four years, you look at your sales numbers and congratulate yourself for getting out of the oyster farming publishing market segment when you did, because no one's buying oyster farming books anymore.

After all, publishing one frontlist title per year is obviously a sign you take the oyster farming market seriously and want to see it continue.

A One-Line Slurp in Perl 5

By chromatic on August 10, 2009 1:48 PM | 4 Comments

In a comment on A Modern Perl Success Story for the Internet CLI, Darian Patrick asked for an explanation of my file slurping one liner:

my $contents = do { local $/ = <$fh> }

While File::Slurp exists on the CPAN to encapsulate these sort of tricks and Perl 6 provides the slurp method on filehandles, this one-liner is in my toolbox because it's very simple (for me) to remember and type.

Perl Slurp Explained

As you may remember from perldoc perlvar, $/ is Perl 5's input record separator variable. Its contents are a literal string used to identify the end of a record when using readline on a filehandle. Its default is the platform-default newline combination—\n, whatever that translates to on your platform and filesystem.

To read a file with different record separators—perhaps double newlines—set $/ to a different value.

To read the whole file in one go, set $/ to the undefined value.

It's always good practice in Perl 5 to localize all changes to the magic global variables, especially in the smallest possible scope. This helps prevent confusing action at a distance. (I appreciate that Perl 6 moves these globals to attributes on filehandles.)

That explains how this code works:

my $contents = do { local $/; <$fh> };

(I may have first encountered this idiom in perldoc perlsub.)

How does my code work?

Idiomatic Perl Slurp

The localization has to occur before the assignment, for obvious reasons. As it happens, it occurs before the readline. As the readline uses the contents of $/ to determine how much to read, it sees the undefined value, reads the entire file, and assigns its contents to $/. Even though leaving the do block immediately restores the previous contents of $/, the assignment expression occurred in scalar context, thanks to the assignment of the block's result to $contents. An assignment expression evaluates to the value assigned: the slurped contents of the file.

As you may have determined already, the do block both limits the scope of the localization and makes all of the file manipulation into a single expression suitable for assignment to $contents.

Perl Slurp and Clarity

Using slurp from a module is likely clearer. As well, localizing and copying file contents may be somewhat inefficient. In the case of my Identi.ca poster, files will rarely be larger than 140 characters, the program is short-lived, it blocks on user input, and it immediately produces and consumes network traffic, so this is unlikely to be a bottleneck in any sense.

I skimmed the relevant documentation and couldn't find a guarantee that the order of operation of localization and readline will remain as it is; I tried a few experiments with B::Concise to confirm my suspicions, but ran afoul of the Perl 5 optimizer. It may be safer to use two expressions in the block:

my $contents = do { local $/; <$fh> }

Even still, a silly little idiom like this demonstrates some of the interesting features of Perl 5.

To learn more about the idioms of Perl 5, and to learn how to use the language effectively, see Modern Perl: The Book.

The Whipupitude-Neophyte Conundrum

By chromatic on August 6, 2009 1:26 PM

If you put a million monkeys at a million keyboards, one of them will eventually write a Java program. The rest of them will write Perl programs.

— anonymous Internet wag

I was a laser printer guru in 1998. I spent most of my time talking to customers about printing and imaging. I can probably find a blocked paper sensor on an HP LaserJet 5si ten years after my last experience troubleshooting the paper jam message. (I can probably close my eyes and walk someone through it from memory.)

I delivered two programs in that job. One was for the customer service department. Every time someone from level one support had to ask a question of level two support, level two support needed to record the relevant product line. These became daily, weekly, and monthly reports used to bill the subdepartment for each product for the service calls.

I was teaching myself Java at the time. I wrote a small AWT GUI application (in these days, that's what you could get with the Java GNU/Linux port on Red Hat 5.3) which ran on a spare PC we set up at the level two support desk. It logged all statistics to a flat file which another program -- I believe a Tcl program written by a co-worker -- summarized into the reports. This program took a couple of weeks to write. The program was still running when I last walked through that floor of that building in early 2000.

My second program was for the networking support group. In those days, internal support groups often managed the external-facing web site for their products. They wanted a program which could identify when a page or pages in the web site changed and email an opt-in list of customers.

I thought about the problem and wrote a small Bourne shell script (running on HP-UX 10.x, I believe) to do the job. The code must have been between ten and fifteen lines long. As it happens, the networking group used IIS and an early Java application server plugin to manage their web site, so they wanted a Java program which ran on Windows. They asked me to port my proof of concept to Java instead.

I never finished that program. I switched to a system administrator role and discovered that the Perl I'd dabbled with late in 1998 was a much better fit for the system administration and web development I needed to do.

The afternoon of my first day on the new job, the administrator I was replacing turned around and asked me if I knew Perl. I'd dabbled. I'd written a bit. I could read a program and have some sense of what was happening. I said "Some, yeah."

He was working on a program and kept getting a weird syntax error. I looked over his shoulder and said "That elseif should be an elsif."

As I went through the files he left, I found several printouts of Perl programs he'd downloaded from the Internet, changed the copyrights, and tweaked very slightly.

If the department I worked in back then still exists (and it might) and if someone still remembers me from then (doubtful), I'd have no surprise to learn that some of the code I wrote there still exists. (One evening I saved my group a million dollars in penalties by fixing a misbehaving FTP server right before a deadline to deliver new firmware images to a vendor. They gave me a plaque.)

I've toyed with Java some since then, but I haven't delivered any significant software in Java. I hear it's a passable language with some great libraries and effective tools.

I've spent the intervening years understanding software development, helping create a testing culture, figuring out what's wrong with object orientation and how to fix it, and learning how virtual machines work to help invent the future (or at least drag mainstream dynamic languages kicking and screaming into the '90s).

I delivered that first Java program because I was stubborn and had a lot of spare time over that fortnight.

I've delivered almost every Perl program since then because of Perl's whipupitude factor. That's how the administrator I replaced could download a likely program, tweak a few things, and pass off a working solution as his own. That's how a motivated, mumble-something fledgling music performance major like me could write a proof of concept mail notification service in an hour with the Bourne shell and port it to Perl 5 in an afternoon and still not be sure which Java SMTP package actually worked back in 1998.

I have no illusion that I wrote idiomatic Perl 5 code then. (The Perl Renaissance had yet to begin.) I doubt I'd want to maintain that code now. I wouldn't suggest it as a good example for novice developers to emulate.

Yet it did the job. The sound of that code is the sound of me and my users and co-workers not banging their heads against the wall. It's the sound of a 30 second task repeated ten times a day -- and its concomitant context switch -- vanishing.

Anonymous Internet critics can shake their heads and wag their beards and tear their outer garments and throw dust in the air as they wail and weep that ugly Perl code exists and persists.

I'm all for improving Perl code. I believe in tests and refactoring and idioms and reuse and clarity and good naming and succinctness and encapsulation and proper factorization and favoring simplicity over cleverness. I believe in teaching novices and guiding them and encouraging them to develop good habits and consider maintenance and understand what happens and how things work and to think not only about what they want to do but why and how to do so sustainably.

I want better defaults and clearer error messages and fewer rough patches where they can trip and stumble and skin their knees and elbows.

I want that very much.

Yet the whipupitude that lets novices -- and there are some six and a half billion people on the planet who've never even thought they could write a useful computer program -- do useful things even if sometimes they do it the ugly way is a positive aspect of Perl. It's a positive aspect of many tools.

By all means let's continue making Perl easier to understand and gentler to learn and more encouraging to write good code the right way. That's important. We're all novices sometimes. Let's help them get their work done and then help them get their jobs done well. Programming well is difficult; why not encourage them with small victories? Why make the barrier to accomplishment artificially high?

I don't understand that argument.

Then again, some people probably think that Pablo Neruda never existed after hearing me speak Spanish.

A Modern Perl Success Story for the Internet CLI

By chromatic on August 5, 2009 4:44 PM | 8 Comments

I've never believed the argument that web applications will completely replace native client-side applications. The last time someone said that, I held up my Android phone and said "This is a pervasive Internet device. I use several applications regularly. The only native web application I use is a web browser."

Maybe I'm an old dinosaur well before my time, but I'm happy writing these posts in Vim (with copious macros and customizations) through It's All Text!. I don't mind using Git from the command line, and I think ctags is still one of the most useful IDE inventions.

That doesn't mean I reject the billion or so devices that make up the Internet, nor all of the information they contain, nor the services they provide.

It means I can be clever about using them.

A Modern Perl Identi.ca/Twitter CLI

Twitter may be a morass of uselessness which all too often proves the existence of sewers upstream of the collective stream of consciousness, but in between all of the "lollers" and "RT RT RT lollers cat falling down video!!" sometimes it's a good way to pass on lightweight, transient information to people who might care.

I've used Identi.ca for a year now. As part of marketing my business, I set up a couple of accounts to which I publish notifications about the fiction and non-fiction publishing work we do. If I edit a chapter in an upcoming novel we've announced, I can send out a short notice to anyone following that novel's progress. The same goes for a book.

It's easy, it's quick, it's non-intrusive, and people really do seem to follow this information. (Wouldn't you like to know how far away is the next book from your favorite author? Still waiting for that seventh Zelazny Amber story....)

Setting all of this up for multiple books and multiple accounts could be tricky. I could log into Identi.ca manually every time I want to post an update (maybe a couple of times a day). I have to manage several user names and passwords. What could be a 15-second update when it's fresh in my mind could take several times that long, and I wouldn't do it.

That's why I have a decent shell and modern Perl.

I wrote a proof of concept in ten minutes using Net::Identica; all hail Perl and the author:

#!/usr/bin/perl

use Modern::Perl;
use Net::Identica;
use File::Temp 'tempfile';

sub main
{
    my ($username, $pass, $message) = @ARGV;

    do
    {
        $message = get_message( $message );
    } until ($message && length( $message ) <= 140);

    post( $username, $pass, $message );
}

sub get_message
{
    my $message = shift;

    my ($fh, $filename) = tempfile();
    print {$fh} $message if $message;

    system( $ENV{EDITOR}, $filename );

    seek $fh, 0, 0;
    exit if eof $fh;

    return do { local $/ = <$fh> };
}

sub post
{
    my ($username, $pass, $message) = @_;

    say "[$message]";

    my $ni      = Net::Identica->new(
        legacy   => 0,
        username => $username,
        password => $pass,
    );

    $ni->update( $message );
}

main();

I've highlighted the most interesting line. The program takes three arguments, a username, a password, and an optional message to post. The emboldened line launches (for me) Vim on a temporary file containing the message. If there's no message, I can write one. If there is a message, I can edit it. When I save the file, the program immediately posts it to Identi.ca, with the given username and password.

It's easy to create bash shell aliases to customize the behavior further. I have a file called ~/.alias, which can contain, for example:

alias gigdent="perl ~/bin/dentit gigapolis password"
alias mpdent="perl ~/bin/dentit modern_perl password"

For every new project I start, I can create a new account and shell alias. Then I only have to remember the name of the alias to write a status update.

50 generously-spaced lines of modern Perl code and a bit of shell glue later, I can get my work done with little fuss and even less cognitive overhead from task-switching. Maybe it's not jaw-droppingly amazing like rounded corners on a Web 2.0 startup written by skinny-jeans hipsters with deliberately messy hair and flipflops, but it helps gets my work done and it took longer to write this post than to write the code.

I'll call that a Perl success story any day.

Reduce Complexity, Prevent Bugs

By chromatic on August 3, 2009 4:54 PM | 1 Comment

I spend a lot of time thinking about how to prevent bugs in Parrot. My first contribution to the project was a patch in late 2001 to make an essential Perl 5 program used in the build compatible with Perl 5.004. (My, how times change.) I've spent countless hours in the intervening seven and a half years helping the project become correct, complete, viable, and competitive.

Many of my opinions about the maintainability and sustainability of software projects come from experiences with Parrot (sometimes to the chagrin of people who don't know the other projects I can't talk about which have similar characteristics).

Fiddly Bits of Parrot Not Always Easy to Write Correctly

Parrot uses pervasively a data structure called a PMC,a PolyMorphic Container (or Parrot Magic Cookie). A PMC represents anything that's not a primitive value -- anything more complex than an integer, a floating point value, or a string. In Perl 5 terms, a PMC resembles an SV. Don't take that line of thinking too far; PMCs take the good parts of SVs and avoid the scary, complex parts of SVs.

Because Parrot hasn't quite managed to get rid of C entirely yet (see the Lorito plan for more about that), we have several dozen core PMCs written in C.

A PMC has several well-defined behaviors which forms the vtable interface. These are common operations that any PMC should be able to perform: get a scalar value, set an integer value, access a nested PMC, invoke the PMC as a callable function. Not every PMC performs every defined vtable function, but unimplemented functions produce Parrot exceptions rather than interpreter crashes.

Additionally, most PMCs have attributes. Think of a PMC as a class, with instances of that PMC as objects and PMC attributes as instance attributes and vtable functions as instance methods, and you have a conceptual understanding which works at a high level.

Because of our current use of C as the PMC declaration language, PMCs need to understand their memory management characteristics. In other words, if your PMC has two INTVAL attributes and one PMC attribute, the PMC initializer (like a constructor, in OO terms) needs to allocate enough memory to store these three attributes. Similarly, the PMC's garbage collection mark vtable function needs to be able to mark any PMC stored as an attribute as live. The PMC's destroy vtable function (a destructor, of sorts), needs to release the memory allocated for attribute storage back to the system.

(Don't you have a garbage collector?, you may ask. That's a good question. We could let the garbage collector manage the lifecycle of all of these pieces of memory, but they're already attached to GCable elements, so we don't need to mark or sweep or trace them. The malloc/free memory model works here well enough, even though we use memory pools to avoid the costs of malloc/free.)

Why Fiddly Bits are a Problem

Thus to write a PMC without any garbage collection errors, without any memory leaks, and without any random corruption waiting to happen, you had to remember several steps. In practice, people writing their own custom PMCs copied and pasted behavior from an existing PMC, then refactored it until it did what they wanted.

I spent a couple of weeks reading every line of every core PMC in Parrot. I fixed a lot of bugs. I can spot GC and memory bugs in patches. The problem is that I don't scale and you can't get the experience I have without going through all of the bugs I've gone through -- and if I never read your patch, you may still have that bug.

Properly Encapsulated Complexity

Julian Albo and Andrew Whitworth (and several other Parrot developers) made an improvement recently in this area.

PMCs with attributes need to declare them. We use a mini-language built around C to define PMCs. For example, the PMC which represents an object in Parrot (the Class PMC) has two attributes, a PMC which represents the class of the object and a PMC which contains the instance variables of the object. The code looks like:

pmclass Object need_ext {
    ATTR PMC *_class;
    ATTR PMC *attrib_store;

    /* vtable entries go here */

    /* PMC methods go here */

The PMC to C conversion step creates a C struct to hold this PMC attribute data:

/* Object PMC's underlying struct. */
typedef struct Parrot_Object_attributes {
    PMC * _class;
    PMC * attrib_store;
} Parrot_Object_attributes;

Thus at Parrot's compilation time -- when we compile the Parrot virtual machine -- we know how much memory to store the attributes of each PMC. We know which PMCs have attributes (not all do). We know which PMCs need to mark their attributes specially (this one does, as its attributes are GCables and not primitive values).

Julian's idea was to store the size of the attribute structure in the PMC structure. When allocating a new PMC, the PMC initialization code also allocates memory to contain the PMC's attributes and attaches it. Thus all of the bookkeeping code in PMC init vtable functions can go away. When destroying an unsed PMC, the PMC destruction code can free this memory. Thus all of the bookkeeping code in PMC destroy vtable functions can go away.

We can even get rid of a special PMC flag value which meant something to the garbage collector but was fiddly to get right, because people often forgot to enable it.

This new code is obvious to prove correct. It either works or it doesn't. It's one codepath to examine and patch, not dozens of core PMCs and countless other PMCs existing now or in the future. This reduces the amount of code people need to write and reduces the amount of code existing in our system.

We've moved the internal bookkeeping mechanism from the user-visible portions of Parrot. If you want to hack on the GC, feel free -- but most people shouldn't have to. They shouldn't even have to know how it does what it does. (That won't hurt, but they shouldn't have to know the mechanisms by which it does what it does.)

That's one principle of software development I always encourage. Encapsulate confusing or dangerous or difficult code behind a nice interface. Now you don't have to worry about doing the wrong thing because you don't know how to write code which does the wrong thing. If you don't write any code at all, Parrot will do the right thing for you.

Yes, we changed the way you define PMCs -- but tell me that this isn't an improvement for everyone. That's a principle of modern Perl I want to encourage.

« July 2009 | Main Index | Archives | September 2009 »

August 2009 Archives

Who's It For?

A CPAN for Normal Users

A Digression about Multiple Implementations

The Vision and the Ecosystem

Bareword indirect invocations

Indirect notation scalar limitations

Alternatives to indirect notation

The Problem with Prototypes

Good Uses of Prototypes

Runtime

Executing Code During Compile Time

The Optree

On Bytecode

Infrequently Asked Questions

On Static Parsing

Changing the Parse

Changing Expectations

Changing Arity

Barewords

The Real Problem

The Implications

How Book Sales Work

The Freshness Factor

The Missing Data

A Completely Hypothetical Fictional Example I Made Up Completely From Whole Cloth

Perl Slurp Explained

Idiomatic Perl Slurp

Perl Slurp and Clarity

A Modern Perl Identi.ca/Twitter CLI

Fiddly Bits of Parrot Not Always Easy to Write Correctly

Why Fiddly Bits are a Problem

Properly Encapsulated Complexity

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Archive