May 2012 Archives

When You Can't Misuse the Immutable

By chromatic on May 31, 2012 2:02 PM | 4 Comments

There are two well-known ways to make things impossible to misuse. One way is to remove all but the simplest and safest features. You could call this the Haskell Report approach, where the original language avoided most security holes by forbidding user input.

(Yes, I know that wasn't the point of the Haskell Report. Fortunately, most of the really good Haskell fans have good senses of humor.)

The other approach is less popular. It allows people to perform complex tasks—even dangerous ones—but it prefers to make them safe by making the safe things easy to do correctly. You could call this the CUC approach, where a language designer would never say "Hey, let's provide a builtin where you can load arbitrary data from the Internet and automatically populate values into your single global variable namespace! Including $is_admin and $is_verified and $discount!"

It's easy to believe that the next decade of professional programming will deal with the tension between functional programming and just-get-something-done-now programming. (Objects have won, at least enough. Genericity has won, at least enough. JSON and REST have won, at least enough to wash the taste of SOAP out of our mouths.) Popular belief may claim that functional programming is interesting because it can provide cheap parallelism with the flip of a switch. Popular belief is wrong, but popular belief is popular.

The interesting parts of functional programming aren't the silly utopianisms that "Learn Haskell and your programs will automagically scale to 256 cores and beyond!" but the way we can steal useful features of functional languages as patterns for real programs that aren't solely pure, functional, typeful, lazy, and referentially transparent.

To call back to the Haskell Report reference, we handle user input all the time, but we can be safer and saner about it.

Many of your programs will benefit from pervasive laziness. (Most of mine do.) Many of your programs will benefit from reducing side effects, especially global side effects. It's not just Haskell; you're probably already using dependency injection too.

Another great feature of functional programming is immutability, where things can't change after you've created them.

This is in truth a great feature of any API. I reviewed some code from a slightly less experienced developer a couple of hours ago. He's competent with Perl, though his code has a C accent that comes from electrical engineering school. Good C programmers handle errors. Robust C code spends probably at least 30% of its lines of code on error handling. This code was no exception.

The bad news is that C has few possibilities for abstraction. In particular, C doesn't make it easy to create immutable data structures. The good news is that Perl does.

The biggest suggestion I made was to consider a pattern I've discovered while using Moose pervasively: perform all of your verification and validation at the point of object construction and resist the temptation to make mutable objects. This pattern obviously needs a catchier name.

The benefit to your API is that if you can create an object, the object is always in an consistent state. Any validation errors get reported from the point of the code that attempts to create the object. This keeps the scope of the error reporting to the point at which it matters the most (at least to the constructor API). This reduces the possibility that someone will manipulate the object and make it invalid.

You have to go to great lengths to misuse such an object (or such an API) because you have to bypass its interface altogether, if the interface forbids mutation.

You get to remove a lot of error checking code because within the API you can assume that the object is always in a valid state. (Anyone who's violated that contract gets to keep the broken pieces.)

This pattern would work just as well in C or Python or PHP or Java. Nothing is specific to Haskell or Perl that makes this pattern impossible in other languages. It works better in Haskell than in Perl because the language supports it much like it works better in Perl than in C because the language somewhat supports it and great libraries make it easier.

The technique works, if you're diligent about figuring out what changes from what stays the same at each level of your design. That may mean some classes or data structures need to break into smaller pieces. That's not always easy, nor always simple. It's often not obvious.

The benefit, though, is what the functional programmers have been saying all along. Reducing mutable state can help us write better code and less code and code that's easier to use correctly and much, much more difficult to use incorrectly.

(One of the reasons I use Perl 5 so much is that Moose makes it really easy to make my classes immutable. Any object system worth using should do the same.)

Why I Use Perl: Reliability

By chromatic on May 29, 2012 10:27 AM | 4 Comments

Perl 5.16 came out last week. That's the tenth stable release of Perl 5 in the past two and a half years and either the third or fourth major release in the same period. (I consider 5.10.1 a major release. Others do not. It matters little.)

I've already switched my main development and deployment environment to 5.16 and will switch over the remaining two user-facing servers to 5.16 in the next couple of weeks. In fact, switching my main development server to 5.16 took eight minutes last Monday, the day of the release.

While so much of the shiny-chasing buzz in the micro-ISV startup built-it-and-they-will-fund world seems to chase Clojure and Node.js and app-in-a-page tricks, Perl 5 continues to be my workhorse. It's my language of choice for prototyping and deployment, in part because it's so reliable.

I maintain three modest-sized applications these days. They need to work and continue to work. (They're in the scale of a few tens of thousands of lines of SLOC, counting tests.) They do rely on a many libraries from the CPAN, including DBIx::Class, Catalyst, and Moose, along with other specialized tools. A quick find for .pm files in my site_perl/ directory shows 4126 individual modules. My rough guess is that this represents between 250 and 300 direct dependencies and their supporting materials.

When the Perl 5.16 release candidate came out, I immediately installed it with perlbrew (see Testing Perl release candidates with perlbrew). I set about installing all of the CPAN dependencies for each of those three projects.

The entire process took four wall hours, though most of that time the process chugged along without my intervention. I found three problems. One is a dependency with a weird version number that'll be a problem on every version of Perl, and it has an upstream bug report. One was a temporary problem with a fresh release of a module already fixed in source control and unrelated to 5.16. The other was a known and temporary incompatibility with 5.16 with an existing patch.

Out of 250 or 300 dependencies for my tens of thousands of lines of production-ready code, on the day of release of the release candidate, only one incompatibility existed, and it already had a patch. (The patch worked.) Think about that. Before Perl 5.16 had its official release, enough of the CPAN had been tested and proven working with 5.16-to-be that I could have run user-facing applications on a release candidate with no disruption in service.

(As of this writing, the CPAN hosts almost 25,000 distributions and over 107,000 modules.)

All of my tests for each of my applications passed, too.

We in the Perl world talk a lot about the testing culture of Perl. It took a lot of work to demonstrate the value of automated testing way back in the late '90s and early 2000s, but that word has spread beyond the language itself and into the ecosystem.

I have the utmost confidence that I can switch the two remaining servers over to Perl 5.16 at my leisure in the next couple of weeks with minimal effort—not only because I've tested my applications with Perl 5.16 to my confidence but because the entire Perl ecosystem has been tested extensively across platforms.

For me, one of the strongest reasons to use Perl is that upgrading is no longer a question of "What will break?" but "How boring is this process?"

(If you haven't used Perl 5 in a while, Modern Perl: the Book shows how to get the most out of the language.)

Keep Your ResultSets DRY

By chromatic on May 25, 2012 12:29 PM

For your amusement, here's my fuzzy-headed bug of the day. I wrote this code.

This application processes documents. A document may refer to one or more images outside of the document. Where this happens, we fetch all of those external images until we find one which can serve as a useful thumbnail for our document summary.

One stage of the document analysis pipeline identifies potential images. Another stage fetches and analyzes those images. We use DBIx::Class as our storage layer. We set up a request for valid potential images with:

my $valid_image_rs    = $image_rs->search({
    state                => { '!=' => INVALID },
    'fetchable.validity' => { '>=' => 0 },
}, { join => 'fetchable' });

We have, of course, an image table. We also have a fetchable table. The latter represents anything which has a URI and which we may make a network connection to fetch. This allows us to store metadata such as fetch time, etag, and last-modified header values along with a validity score. Every time we fail to fetch an image for reasons of network connectivity (on either side), we remove a point from this score. After n unsuccessful fetches, we don't try anymore without manual intervention.

This is all well and good.

Documents have a state attribute, just like images do. A document in the WAITFORIMAGE state will receive no processing until we've processed all of its images and either found an appropriate thumbnail for the summary or invalidated all of the candidates. At that point, the document should move on to another stage.

For various uninteresting technical reasons, the code to move a document from the WAITFORIMAGE stage to the WRITESUMMARY stage needs to query the database about the related images. That code looks like:

my $images_rs = $self->entry_images->search_related( 'image' );

my $image_count = $images_rs->search({ state => { '!=' => 'INVALID' } })
                            ->count;

You probably see the bug right there, in that there's no check of image fetchable validity.

I've said more than a few times that I really like how DBIC lets you enhance your ResultSet classes with custom methods. I use this quite a bit to store searches I find myself using over and over. (This isn't an MVC application, but this technique is still a great way to separate model concerns from controller actions, such that your controller can call methods on the model and leave the searching code to the model classes entirely.)

If I'd moved this ad hoc query to the appropriate ResultSet class, I wouldn't have had this problem. (I wouldn't have spent a day debugging it.)

DBIC gives you the power to encapsulate interesting queries behind method calls. Take advantage of this. Anything more complicated than a very simple search probably warrants this treatment.

Toward Coding Without Conditionals

By chromatic on May 24, 2012 2:16 PM | 2 Comments

My favorite refactoring is Replace Conditional with Polymorphism. If I were to make alist of everything I've learned the hard way in programming, this refactoring would be at the top of the list near "Make it easy to test always" and "Make it difficult to misuse".

I'm sympathetic to the view in interface design that the fewer choices, the better. I believe that push for simplicity comes from the desire to force the design to consider the right choices. If you get the default behaviors right, if you choose the right things to make easy and obvious, and if you don't forbid people from extending your code or product to do special things, you have achieved something lovely and worthwhile.

When we design things, we have to find a balance of flexibility (people should be able to use this for purposes we can't imagine right now) and simplicity (we should emphasize the few things we do well and right). That goes for writing code, from designing APIs to each line of code we right.

APIs are user interfaces for programmers, sure. The individual lines of code we write—within our functions and our methods and our loop bodies—are APIs for understanding the problems we're trying to solve.

When I write code, I want to write boring code. I want to write code so boring that it's easy to overlook the code because the intent of the code is so staggeringly obvious. Boring code is straightforward. Boring code doesn't have a lot of corner cases to get hung up on. Boring code you can glance over and see what it's doing.

Boring code gets out of your way.

A lot of the code I've written recently looks like this:

sub method
{
    my ($self, $collection) = @_;

    while (my $item = $collection->next)
    {
        next if $self->do_this( $item );
        $self->do_that( $item );
    }
}

Sometimes there are preconditions:

sub method
{
    my ($self, $collection) = @_;

    return unless $collection->has_some_attribute;

    while (my $item = $collection->next)
    {
        next if $self->do_this( $item );
        $self->do_that( $item );
    }
}

Sometimes there are postconditions:

sub method
{
    my ($self, $collection) = @_;

    my @failures;

    while (my $item = $collection->next)
    {
        next if $self->do_this( $item );
        $self->do_that( $item );
        push @failures, $item;
    }

    return unless @failures;
    return \@failures;
}

I like this form less than the previous two; there's too much structural code for my taste. It's still pretty boring though.

One of my favorite features of generic or polymorphic programming is that you can push the "What should I do if this condition does not apply?" question into methods or pattern matches (of the Common Lisp or ML or Haskell sense, not regular expressions). Dealing with an empty list in Haskell is easy:

sumList []     = 0
sumList (x:xs) = x + sumList xs

... and the corresponding Perl is:

sub sum_list
{
    return 0 unless @_;

    my ($first, @last) = @_;
    return $first + sum_list( @last );
}

Math's a bad example for OO polymorphism, but I would rather write:

for my $item (@collection)
{
    $item->save;
}

... than:

for my $item (@collection)
{
    $item->save if $item->>is_dirty;
}

... and let save() decide what to do. When I phrase it like that, it's obvious. It's the object's responsibility to decide what to do. Maybe that's why non-guard-clause conditionals bother me more and more: they're a sign that I need to move around responsibilities more to cluster them where they really belong. As silly as it seems to say that a little word like if or unless is a sign something's wrong, those little words can be a sign that the current code under consideration is doing too much. It needs to be simpler.

(There's a friction, of course, to breaking classes into smaller domains of responsibility, but that's the subject of a different post.)

The Current Sub in Perl 5.16

By chromatic on May 21, 2012 11:10 AM | 9 Comments

Recursion is one of those computer science ideas that seems so difficult to understand before you get it, and then seems so easy after you understand it that you can't remember not understanding it.

A anonymous function is one of those computer science ideas that makes no sense until it makes sense, and then you understand the beauty (and the horror) of the Von Neumann architecture. (Long story short: computers don't care about the names of things. People do.)

When you combine those two ideas, you get things like the Y combinator, in which you jump through hoops to create an anonymous function which recurses to itself while remaining anonymous.

If you're the kind of pragmatic joker that Perl tends to attract, you might say to yourself "Wow, those Scheme hackers are crazy. Just do it in one step of Perl 5 like this:"

my $func;
$func = sub
{
    my $factor = shift;
    return $factor > 1
        ? $func->( $factor - 1 ) * $factor
        : 1;
};

... which preserves the anonymity of the function and allows recursion at the cost of a memory leak. (Weakrefs fix this, but are you going to remember to do that?)

(At this point, Python fans will say "Just stick a name on that function. There's no reason not to name everything." They may have a point; I went through a phase when I was a kid where I used my parents's embossing label maker whenever I could. The problem with that is that the adhesive leaves a sticky residue on everything. No amount of enforced whitespace can fix that.)

Perl 5.16 fixes this situation.

If you use 5.016; or use feature 'current_sub';, you enable the __SUB__ builtin, which is a reference to the function currently executing.

Suppose you have a tree structure representing articles, something like:

my $tree =
[
    { state => 'READ',   id  => 1 },
    { state => 'UNREAD', id  => 2 },
    [
        { state => 'READ', id  => 3 },
        { state => 'READ', id  => 4 },
        [
            { state => 'UNREAD', id  => 5 },
            { state => 'READ',   id  => 6 },
        ],
    ],
];

Suppose you want to traverse this structure looking for articles in the UNREAD state. In Perl 5.16, you might write something like:

sub traverse
{
    my ($root, $comparator) = @_;

    my @items;
    for my $element (@$root)
    {
        if (ref $element eq 'ARRAY')
        {
            push @items, __SUB__->( $element, $comparator );
        }
        else
        {
            push @items, $element if $comparator->( $element );
        }
    }

    return @items;
}

... as a starting point. Sure, there's no need in the code as it exists now to use __SUB__ instead of the hard-coded traverse(), but consider two pragmatic arguments for anonymous recursive functions. First, the two-step code to make a lexical recursive function (first, declare the lexical variable; in the next statement, use that lexical binding to recurse) has the danger of memory leak and it looks odd. The Y combinator code in Perl is worse. Clarity suffers.

Second, the drawback of using a named function is that that function is available in the namespace. If you write mostly object oriented code (as I do) with hlper functions (as I do), the fact that Perl borrowed Python's misfeature of exposing everything in a namespace as a method means that any named function may be exploitable as a method.

Anonymous functions ameliorate that danger.

Anonymous recursive functions (where recursion is appropriate) are a tool wielded with wisdom.

Anonymous recursive functions with trivial syntactic support (the implementation in the core is next to trivial) are a boon to Perl's pragmatism.

(If you're using an older version of Perl 5, see Sub::Current for a workalike.)

Programming Breaks Things

By chromatic on May 18, 2012 9:38 AM

Computer scientist Edsger Dijkstra famously said "It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration."

I disagree, in principle and in practice. (I disagree so strongly that I work on a project to teach programming to children.)

I believe it's almost impossible to teach programming to someone who hasn't experienced what we USians call "Geometry". That's mathematics: not the specific behavior of triangles and angles and their relationships, but the hard work and creativity and even beauty of following a set of logical rules to a desirable conclusion. People who can do that can program effectively. People who can't do that will struggle.

Before you can solve a big problem, you have to break it.

One of my work projects is a document categorization system. I've written before that it uses a pipeline processing model, where a document moves through the pipeline in various named stages. One stage might be "NEW", while another might be "EXTRACT METADATA". As the system runs, documents make their way through the pipeline in various stages and eventually enter a search index and an archive intended for users.

Documents come from various places, and it's possible for identical (or near-dentical) documents to enter the system at various times. I've long had an exact title match filter as a first approach to remove duplicates, but it's never filtered out enough duplicates. (Some documents are essentially press releases barely edited and republished by multiple news organizations. These documents are almost never interesting and are frustrating in their sameness within the archive, but in the system they go regardless.)

We've talked about several approaches to finding duplicate and near-duplicate articles, with everything from heuristics to identify title similarity to maintaining multiple latent semantic indexes for each unique category of documents. I dragged my feet on the latter because documents expire after 90 days, and managing an n-dimensional corpus search space where one of those dimensions is also time was more work than I wanted.

Wednesday I realized that a naïve approach could give really good results while being easy to code and, more importantly, very quick to run. I coded and deployed it yesterday, and tuned it and deployed an improved version as I was writing this very paragraph. I added a new processing stage which makes a word histogram for every new document entering the system and compares those histograms to existing articles. If they're similar enough, the new article gets invalidated before it enters the search index or undergoes any further processing.

It's silly, but it works. It's 108 lines of code, per sloccount.

I realized something while writing it: programming is breaking things.

Long years of programming experience have taught me that most problems are too big. Most functions are too long. Most methods are too long. Most entities in the system do too much.

If you read much novice code, you see long functions (if you see functions at all) with deeply nested conditionals and mutable state mutating all over the place, because a variable at the top of the program gets used all throughout the entire program. You see a mess, and you see a maintenance burden, and you see someone flailing to control something that's grown way out of hand.

(You see this in part because people trying to learn how to program are also learning the syntax and semantics of a programming language, and until you know the vocabulary rules, you're going to have trouble understanding nuance of meaning and metaphor and idioms.)

I had no trouble writing this code in in the small because I know the tools Perl provides for me: hashes and arrays and methods.

I had little trouble writing this code, because I understand the pattern of fetching a document at a time from an iterator and processing it to get a histogram and putting that histogram in an array for later processing.

I had an easy time testing this code because I know how to write testable code: each of my methods has a well-defined input and a well-defined output and I can test only at those boundaries to see what happens.

Even though you don't know the details of this system, if you're a decent programmer, you can probably write an outline of how the code works just from how I've described it already:

Get a collection of all active extant documents
Iterate over them
Fetch a histogram of each
Get a collection of all new documents
Iterate over them
Fetch a histogram of each
Compare each to every document in the histogram array
Invalidate the document if it matches any histogram too closely
Add the document's histogram to the array

You can probably guess the names of my methods. If you're not exactly right, you're close.

This is the discipline and experience that sets a good programmer apart from a novice. Sure, a novice (or an undisciplined programmer) could write twice as much code to do the same thing and get it working. Maybe he or she could write four times as much code. (I don't pretend that my factoring of this code is the rightest way to do it, but I do know that it passes multiple tests.)

That's my writing in the small. My writing in the large is even more interesting.

Each stage in the pipeline is its own self-contained class. I call them app classes. Every app class conforms to an interface and gets run by a runner. Every app connects to a defined logger and performs its own registration and reporting.

Every app has a method to fetch its basic resultset (every app is part of a processing pipeline; obviously it's going to iterate over documents in a certain state). Every app has method hooks to fire before this iteration and after it. Every app has a process() method which performs the iteration.

I've extracted and formalized the thirteen app classes over the past several months. They started as a series of individual scripts. Then they had a common base class. Now they share code with roles, take configuration out of a common configuration file, and register themselves when loaded as plugins. They can run separately (great for testing) or all together (as is normal).

I knew from the start that I was writing suboptimal code I'd eventually have to change, but that's because I didn't know enough about the problem yet. I'd discover that as the project went on. I'd gain more insight as I saw what kinds of documents we'd have to handle (and how very strange some of them are compared to what we expected).

The original concept of refactoring always reminds me of math. We rearrange things to make them clearer, to prepare us to do other work, or harder work, or at least further work. It's not change for change's sake, and it's not change to add or remove or modify behavior. It's nothing more or less than changing the design of things without changing their behavior.

It's the same skill, from writing functions of the right name and size to putting modules in the right places with the right contents. It's about breaking big things into smaller things. It's about breaking things into the right things.

(Dijkstra is right that BASIC affords few abstraction possibilities to break programs into effective and distinct components, but for novice programmers the experience of turning what seems like a simple task into the steps required to accomplish it is an important experience. That's also one reason why Modern Perl: The Book uses small test programs to demonstrate language features: working in small steps is too important to ignore.)

Time Will Tell

By chromatic on May 16, 2012 3:28 PM

The May 2012 Dr. Dobb's interview with Ward Cunningham has an interesting quote about Ward's notion of technical debt:

I was really devoted to finding great code, especially when objects were new. Objects gave us an extra dimension beyond functional decomposition. And the question was, "Are these the right objects or not?" And the answer was, "Time will tell."

I work off and on with a handful of great programmers in the Portland area. Several years ago, James Shore and Dave Woldrich created CardMeeting, an agile remote collaboration tool. Jim and Dave are both very good programmers. For this project, they decided to forgo their usual test-driven development and just write code so as to deliver a working prototype on a very strict deadline.

Jim took to calling that experience "leveraged technical debt". My estimate (not having read the code, but having tested a lot of code written without testing in mind) is that it takes at least as long to write tests for untested code as it took to write the code and much longer the more time has passed between writing the code and writing the tests.

Jim, Dave, and I have all worked on small, software-driven businesses doing things we've never seen anyone else do before. We've all had to deal with the risk of building lots of code that may or may not solve the problems of real customers with real money. When I say write the wrong code first, I don't mean "deliberately do things you know won't work" or "paint yourself into a corner" or even "use the fact you don't know everything you're doing as an excuse to play with completely new technologies you don't know how to use". (Not that the latter is a bad thing, but if you decide to do that, do so only after you've considered the risks and the rewards.)

Last night, we had a short conversation with John Wilger, another PDXer. He works with a successful and relatively young startup with a huge software component. I don't want to put words in his mouth, but it sounds like their software is, colloquially, a mess. Their developer team is trying to get to the point of slapping hands whenever someone needs to make a change and starts by copying and pasting code.

Four years after founding (and two years after discovering its cash cow business), the company was worth at least $3 billion.

It's irresponsible to derive meaningful statistics from a single data point, but we can say this: the technical debt of their codebase didn't entirely prevent the company from achieving its current measure of success. (You can also say that the liberal application of candy-flavored magical unicorn shavings of Ruby and Rails didn't prevent people from making an unholy mess.)

Time will tell if changing the development culture and refactoring the code and paying down all of the technical debt will help the company adapt and take advantages of new opportunities.

Time will tell if the codebase collapses under its own weight.

Time will tell if a competitor (and several exist!) will prove more agile and nimble because it has much better flexibility thanks, in part, to better code.

The whole situation reminds me of Facebook's HipHop virtual machine, where it's apparently cheaper and easier and faster and less risky to hire lots of developers to create and maintain a compatibility layer for the existing code than to rewrite existing code in a better language, or in a better fashion, or to improve it meaningfully.

I'm not suggesting that the only way to build a big business from nothing is to write bad code. I'm not suggesting that scaling to billions in revenue is the goal of all software-driven businesses. I'm not suggesting that you have to choose between test-driven development and business success.

In an ideal world, I can write the right software the first time. I can have sufficient test coverage to have complete confidence in the behavior of the code. I can deliver a feature which gets me paying customers in an afternoon without having to rewrite other parts of the code or taking shortcuts I know that I'll have to clean up when I get a spare weekend afternoon.

For a profession where some of us call ourselves "engineers", we certainly spend a lot of time discussing practical concerns as if the risks and rewards and limitations of the real world did not apply. (I wonder if the academic/practical divide between computer science and software development has some relationship to this.)

In the real world, I have to remind myself every day when I'm working on proof of concept code that proving my concept workable is more important than solidifying my code into well-tested and well-designed software and when I'm working on code I intend to keep that doing things as right as possible now will help me modify it to get it more right in the future.

None of this guarantees success. All of this benefits from the hard-won experiences I have from doing things the wrong way—and occasionally getting it very right. (In the real world, I spent part of the day finding and deploying a shim to turn SVG into VML for Internet Explorer 8 and earlier.)

Maybe Jim and Dave could have thrown out a couple of features and spent more time writing tests for the most valuable parts of their application. Maybe I'm wasting my time optimizing SQL queries for a search feature no one will ever use. Maybe John's company waited too long to untangle the admin and the user sides of their application.

If we're honest with ourselves, the best answer we can give is that time will tell. May we pay attention when it does.

Separating Presentation from Content in Templates

By chromatic on May 14, 2012 11:47 AM | 2 Comments

A couple of comments on Simple Attribute-Based Template Exporting have asked for an example. I'll show off more of this code in my YAPC::NA 2012 and Open Source Bridge 2012 talk about how to write the wrong code (along with a handful of other techniques).

(I assume some knowledge of Template Toolkit (besides far too many books about finance, accounting, and investing, the Template Toolkit book is always within reach these days); I've set up a wrapper template which provides the standard look and feel of my application and I include/process other templates liberally. If you understand that much, you'll be able to follow along.)

One of the interesting templates in the system displays a list of chapters of a book in progress. A cron job rebuilds a static page from this template once a day. The template looks something much like:

[% USE Bootstrap -%]
[%- canonical_url = 'http://sitename.example.com/book/' _ link -%]

[%- add_og_properties({
    'fb:admins'      => '436500086365356',
    'og:title'       => title _ ' | sitename.example.com',
    'og:type'        => 'article',
    'og:image'       => 'http://static.sitename.example.com/images/logo.png',
    'og:url'         => canonical_url,
    'og:description' => text.chunk(300).0,
    'og:site_name'   => 'Sitename: site tag line',
   })
-%]
[%- add_meta(
    'pagetitle'     => title _ ' | sitename.example.com',
    'feed_url'      => 'http://static.sitename.example.com/book/atom.xml'
    'canonical_url' => canonical_url
) -%]

[% article_text = BLOCK -%]
<article>
<h2>[% title | html %]</h2>
<p>Published: <time datetime="[% date %]">[% nice_date %]</time></p>
[% text %]
</article>

<ul class="pager">
[%- IF prev -%]
    <li><a href="[% prev.link %].html">← [% prev.title | html %]</a></li>
[%- END -%]
    <li><a href="/onehourinvestor">index</a></li>
[%- IF next -%]
    <li><a href="[% next.link %].html">[% next.title | html %] →</a></li>
[%- END -%]
</ul>

[% INCLUDE 'components/social_links.tt', title => title %]
[%- END -%]

[%- row(
    maincontent( article_text ),
    sidebar(
        sideblock( process( 'components/cached/book_latest_chapters.tt' ) ),
        sideblock( process( 'components/cached/book_drafts.tt'          ) )
    )
) -%]

The emboldened lines are most important; they put all of the content produced or assembled by this template in the HTML structure the site needs. That is to say, everything on the site needs to fit into something I call a row. A row can contain multiple elements, such as maincontent and a sidebar, or fullcontent by itself with no sidebar. A sidebar can contain multiple sideblocks.

(You can ignore the other functions; they put metadata in the right places to pass to wrapper templates.)

Within my template plugin (called Bootstrap), each of these elements is a simple Perl function which takes one or more arguments and interpolates it into some HTML:

sub row :Export
{
    return <<END_HTML;
<div class="row">
    @_
</div>
END_HTML
}

sub sidebar :Export
{
    return <<END_HTML;
<div class="span4">
    @_
</div>
END_HTML
}

(I initially tried to write these functions as templates within Template Toolkit itself, but there comes a point at which you want a real language. That point came very early for me.)

I lose no love over the varname = BLOCK pattern necessary to populate variables to pass to these plugin functions, but it works for now. In some of my templates—usually those with lots of text I might end up changing later—I extract that text into a separate template under components/content/ to make it easy to edit. (This idea came up during a client project where the client wanted to edit the legal clickthrough arrangement after users create accounts. I didn't want lawyers or anyone to have the ability to mess up the templating language, so I said "Edit this single file as plain HTML and you'll be fine." It worked great.)

While my programmer brain says "This is ugly, and you're a horrible person for committing this hack upon the world—you're calling Perl from your template system to generate HTML you're stuffing into a template and that puts your presentation elements in Perl code, you awful human being!", it keeps the presentation code in a single place where I can update it infrequently (being that I don't change the layout of the site dramatically) without having to change the divs and classes of multiple templates.

I'm not arguing that this technique as expressed here is right. It's probably not optimal; there may be easier approaches to achieve the same effects.

I am saying that this currently works very well for me. I'm not typing the same HTML over and over and over again, and I can tweak it much more easily than I did before when I was refining the look and feel. In fact, I've even forgotten the exact details of the layout, from the HTML/CSS point of view, and now think only in terms of rows, maincontent, and sidebars.

Working abstractions are very nice.

Simple Attribute-Based Template Exporting

By chromatic on May 11, 2012 1:29 PM | 3 Comments

If you're like me and your design skills are sufficient to modify something decent to look nice but insufficient to create something from first principles, you can do a lot worse than to play with Twitter Bootstrap for your next web site.

I've used it successfully for a few projects and it's been great.

It's a lot better now that I've written my own silly little Template Toolkit plugin to reduce the need for writing lots of repetitive HTML in my templates. (It's like Haml but less ugly and more Perlish and easier to extend.)

Writing a TT2 plugin is relatively easy. Of course I do it the wrong way; when you initialize your plugin, you have the ability to manipulate TT2's stash. This is the data structure representing the variables in scope in your templates. Where a well-behaved template should use object methods to perform its operations, my code stuffs function references in the stash. Here's the relevant code:

sub new
{
    my ($class, $context, @params) = @_;

    $class->add_functions( $context );

    return $class->SUPER::new( $context, @params );
}

sub add_functions
{
    my ($class, $context) = @_;
    my $stash             = $context->stash;

    while (my ($name, $ref) = each %exports)
    {
        $stash->set( $name, $ref );
    }

    $stash->set( process => sub { $context->process( @_ ) } );
}

I'll fix this eventually, but the process of making this work was interesting.

In my first attempt (see Write the Wrong Code First for the justification), I'd write the function I needed, like row(), which creates a new Bootstrap row or maincontent() which creates the main content area of the page. Then I'd add that function to the %exports hash and everything would work.

After the sixth function, keeping that list up to date was tedious. Then I kept forgetting it. After all, any time you have to update the same data in two places, you're doing something wrong.

Now the code looks more like:

sub row :Export
{
    return <<END_HTML;
<div class="row">
    @_
</div>
END_HTML
}

... with a single code attribute marking those functions which I want to stuff into the template stash. I've used Attribute::Handlers before, but I always end up reading the manual and playing with things to get them to work correctly. (Something about the way you have to write another package and inherit from it to get your attributes to work correctly always confuses me.)

My second attempt lasted no longer than ten minutes. I switched to Attribute::Lexical. This is almost as trivial to use as to explain:

use Attribute::Lexical 'CODE:Export' => \&export_code;

Whenever any function has the :Export attribute, Perl wil lcall my export_code() function:

my %exports;

sub export_code
{
    my $referent = shift;
    my $name     = Sub::Identify::sub_name( $referent );

    return unless $name;
    $exports{$name} = $referent;
}

The first argument to this function is a reference to the exported function. I use Sub::Identify to get the name of the function reference. (That wouldn't work for anonymous functions, but I can control that here.) Then I store the name of the function and the function reference in a hash.

It took as long to write as it does to explain.

A lot of people dislike the use of attributes. Used poorly, they create weird couplings and plenty of action at a distance. Attribute::Handlers can be confusing.

I like to think that I'm using attributes well here (even if I'm abusing TT2 more than a little), and that they've simplified my code so that I can avoid repeating myself and performing manual busywork that I'm likely to forget. Even better, the code to use them isn't magical at all: it's all hidden behind the pleasant interfaces of Attribute::Lexical and Sub::Identify.

Write the Wrong Code First

By chromatic on May 9, 2012 11:37 AM | 6 Comments

I rewrite code often.

If I were a better programmer, designer, or businessman, I would rewrite my code much less frequently—but I get things wrong about as often as I get them right. Even with years of practical experience, software's still too difficult to predict with any degree of accuracy.

As a case in point, I've been revising some financial software in the past week. In reviewing the calculations, I found a way to simplify them dramatically. Even better, these simplifications allow me to simplify the interface and user experience.

That means rewriting a lot of code. That means throwing out code and revising the storage model and making a lot of changes.

I'm fortunate to have a good test suite that runs in 15 to 20 seconds and lets me know that everything I most need to work continues to work. That's a lot of confidence. People who like to talk about test-driven development and refactoring tout this as one of the benefits of well-tested software: you can refactor with confidence.

I'm not refactoring. I'm throwing away parts of this application and adding others. I'm changing how it behaves. Even though my test suite helps, that's not refactoring.

As part of this project, I've added an SVG graph to a class of web pages. I started by creating the SVG in Inkscape. Then I exported it as plain SVG. Then I made a template for that SVG to include from the page template.

That was still the example SVG with sample data, still the proof of concept.

I then extracted one piece of hard-coded data and made it a templated value. One. Everything still worked. Then I extracted the second piece of data and so on.

It's one step at a time. It's one change at a time. I'm using Git, so I could even commit after every single change, no matter that it's a few characters or even merely changing the color of a bar in the graph. I can work in steps as small and discrete as possible, and then squash them into one big commit or rewrite them into functional units, or do whatever I want with them.

That's the same principle behind test-driven development (or test-driven design or even behavior-driven development, if you need to hang a new name on the same idea). Do one thing at a time. Make your code do a little more of what it needs to do. Prove that it all hangs together, that it all works, that it does what you intended.

Then clean up a little bit. That's refactoring, in your code and in your tests. That's rebasing in Git.

Sure, I wish I could know exactly what I needed to write from the start. I wish sometimes that programming were mere transcription of the voice of an ephemeral muse (though I find it difficult to imagine a muse dictating Perl or JavaScript or Haskell or J aloud). I wish I were the Beethoven of programming (without the mercurial temperament and the hearing loss).

Usually I don't get things right from the start. Fortunately, a little discipline and the willingness to work in small steps, to erect and replace the scaffolding as I go, and I usually get a lot closer to the right code than if I guessed.

Maybe that means I've thrown out more code than I've written. (It's satisfying to delete unused code, after all.) Maybe any project which starts as a proof of concept, then has to pivot in other directions to do what it's always needed to do always becomes a Ship of Theseus.

I'm okay with that. It's more important to me to create something useful and then make it right than to wait on getting it right before other people can find value in it. I may never write the right code from the start, but I believe I can make almost-right code much, much more right, with discipline and care and feedback.

NYTProf, File IO, and an Optimization Gone Awry

By chromatic on May 7, 2012 2:56 PM

One of my projects performs a lot of web scraping. Once every n units of time (where n can be days or weeks), a batch process fetches several web pages and extracts information from them. It's a problem solved very well.

I designed this system around the idea of a pipeline of related processes, where each component is as independent and idempotent as possible. This has positives and negatives; it's an abstraction like any other.

I initially wrote the "fetch remote web page" and "analyze data from that page" as a single step, because I thought "analyze" was the main goal and "fetch" was a dependent task. I separated them a couple of weeks ago to simplify the system: analysis now expects data to be there, while fetching can be parallel on a single or across multiple machines. (Testing the analysis step is also much easier because feeding in dummy data is now trivial.)

I use the filesystem as a cache for these fetched files. That's easy to manage. I modified the role I use to grab data for the analysis stage to look in the cache first, then fall back to a network request. That was easy too. The get_formatted_data_for_analysis() method looked something like:

sub get_formatted_data_for_analysis
{
    my ($self, $type, $key) = @_;

    my $cached_path         = $self->get_cached_path( $type, $key );
    if (-e $cached_path)
    {
        my $text = read_file( $cached_path );
        return $self->formatter->format_string( $text ) if $text;
    }

    return $self->formatter->format_string( $self->fetch_by_url( $type, $key ) );
}

I thought I was done. This trivial caching layer took five minutes to write and gave my project a lot of flexibility.

I thought this would speed up the processing stage, because I was able to make the fetching stage embarrassingly parallel so that more than one fetch could block on network IO simultaneously. My rough benchmark didn't show any speed improvement, but it was fast enough, so I moved on.

On Friday I decided to profile the slowest stage of the application with Devel::NYTProf. The slowest stage was the processing stage. I isolated it so that it performed no network fetching. It was still slow.

One of the formatter modules used to extract data from web pages is HTML::FormatText::Lynx. It allows me to run lynx --dump to strip out all of the HTML and other formatting of a document. The formatter allows you to pass in the name of a file or the contents of a file as a string.

For some reason, most of the time in the processing stage in the profile was spent in file IO. That wasn't too surprising; these aren't all small files and there may be thousands of them. I dug deeper.

Most of the time in the processing stage in the profile was spent in reading the files in my method and reading files in the formatter—reading files, even though I was passing the contents of those files to the formatter as strings.

I poked around at a few other things, but came back to the source code of the formatter. A comment in HTML::FormatExternal says:

format_string() takes the easy approach of putting the string in a temp file and letting format_file() do the real work. The formatter programs can generally read stdin and write stdout, so could do that with select() to simultaneously write and read back.

In other words, all of the work I was doing to read in files was busy work, duplicating what the formatter was about to do anyway. (Okay, I stared at the code for a couple of minutes, thinking about various approaches of rewriting it and submitting a patch or monkey patching it. Then I turned lazier and wiser.) I rewrote my code:

sub get_formatted_data_for_analysis
{
    my ($self, $type, $key) = @_;

    my $cached_path         = $self->get_cached_path( $type, $key );
    return $self->formatter->format_file( $cached_path ) if -e $cached_path;

    return $self->formatter->format_text( $self->fetch_by_url( $type, $key ) );
}

The result was a 25% performance improvement.

Three things jumped out at me in this process. First, how nice is it to have a working tool like NYTProf and a community that distributes source code, so that I could examine the whole stack of my application to isolate performance problems? Second, how interesting that an assumption and an admitted shortcut in a dependency could have such an effect on my own code. Third, how much more I like my new code with all of the file handling gone; pushing that responsibility elsewhere is a nice simplification without the performance improvement.

Perhaps the two tools I miss most from my C programming days are Valgrind/Callgrind and KCachegrind, but NYTProf goes a long way toward filling that gap. Besides, I'm at least 20 times more productive with a language like Perl.

Smoothing the Condescending Onramp

By chromatic on May 2, 2012 2:42 PM

If you ever need a dose of humility, solve a non-trivial problem and then watch a Real Actual User try to figure out how to use it.

In my second professional job, when I was a system administrator at HP, I worked in the laser printer group. One afternoon, someone walked by my desk and asked me to do a user interaction study. I followed her to a little lab area, where she handed me a list of tasks, and asked me to complete them.

I did, except that I misread the icon on the copier and put in the source pages upside down, and made ten warm and blank pieces of paper. As soon as that happened, I understood the icon and why I'd misinterpreted it.

I never heard the results of the study, but I hope my stubborn confusion ended up improving the product.

User experience (and real user experience, not the fake user experience stuff that says users are clueless and incapable of all of the complexity of navigating the cereal aisle of an American grocery store and thus interfaces must degenerate to a single beveled button which says "DO IT", do you like my black turtleneck?) is fascinating. What's clear to you, you who understand the internal model of the software, is perfectly opaque to users. Users know the results they want, but not necessarily how to achieve them.

Making things easy for novices—for people who don't have a correct internal model of the software—can be compatible with making powerful software. Consider the Perl 5 standard Unicode preamble necessary to convince Perl to use the defaults you probably want to handle anything-but-Latin-1 correctly.

(When user complaints of "My code doesn't work!" get met on PerlMonks and the Perl Beginners List and elsewhere with "What's the error message?", you know the languages, libraries, tools, and ecosystem could do more to help people debug their own code.)

You see the problem when books and other tutorial materials say "Error checking is left as an exercise for the reader", as if the burden of writing correct code or the increased page count is far more important than the desire to help new programmers learn how to code well.

I'm not only talking about better defaults (like strict enabled with use 5.014;). I'm not only talking about writing and collecting good Perl tutorials. (Part of the reason Modern Perl: the Book is available for free online is to continue to cultivate the culture of making great tutorial material available to anyone and everyone.)

With that said, I do despise the attitude of "You have to be clueful enough to use the proper incantation at the start of your programs before you'll get help on PerlMonks". Sure, those of us who know Perl now had to learn the hard way that symbolic references and global variables make our code harder to manage, that a unified testing system can only improve the CPAN, and that agreeing on an interoperable OO syntax (if not implementation) lets us concentrate on solving problems, not rebuilding Greenspun frameworks, but that's no reason to force the same learning curve on novices.

We'll never remove the essential complexity from programming (to do so would require us to remove the essential complexity from the problems we're trying to solve). We can smooth out the onramp for new programmers. That requires us to think like new programmers and to understand what they're trying to do and why.

Sometimes that recommends that those of us who see a question and think "Wow, everyone knows how to use a hash! What's wrong with you for not understanding this?" to shut up. (Sometimes the best person to help a new programmer is someone who was recently new.) Often times that requires us to listen and look for the deeper question.

That probably recommends us to be a little gentler on the audiences we reach when we publish text and code. As Tom Dale wrote in Best Practices Exist for a Reason:

writing code before you have an expert-level understanding is okay.

(The whole post and its comments are... enlightening.)

Ultimately I expect the real point is to know who you're writing for. If you're only ever writing for your own amusement and you're willing to cut off everyone who doesn't share your level of knowledge, that's one thing. If you're writing to help other people—even if they have just started using Perl today—perhaps there are ways you can smooth the onramp for them a little bit more. After all, the things we think are easy now are because we understand the intricacies of lexical binding and scope and default topicalization and eager versus iteration file reading and so on.

« April 2012 | Main Index | Archives | June 2012 »