October 2012 Archives

I Stopped Parsing XML Thanks To XML::Rabbit

By chromatic on October 29, 2012 5:15 PM | 6 Comments

A while back, Robin Smidsrød sent me a link to a series of articles starting with Implementing WWW::LastFM. He'd just released a new CPAN module to simplify extracting information from XML documents and wanted to show it off.

I put off looking at it, and then I looked at it, and I was impressed.

Then I had to write ETL code for clinical reports from the US FDA (a regulatory agency which, among other things, requires specific controlled studies before approving drugs to be sold to the public within the US). The ~135,000 reports in my dataset range from a few dozen lines of XML to hundreds of kilobytes, and they all roughly follow the same format. (Some fields are apparently optional, and some fields have started to appear over the years.)

I looked at a four example files and very nearly started to prepare myself to use an XML parser to extract data, and then I remembered Robin's articles and his module XML::Rabbit. (You may have opinions about non-technical names, but I remembered the name, and that's what matters.)

XML::Rabbit is something like an object-XML mapper which uses Moose and XPath. In other words, you declare the attributes of a class as mappings of names to XPath expressions.

Consider an XML document which corresponds to a record like NCT0021366:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="89160">
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <required_header>
    <download_date>Information obtained from ClinicalTrials.gov on October 25, 2012</download_date>
    <link_text>Link to the current ClinicalTrials.gov record.</link_text>
    <url>http://clinicaltrials.gov/show/NCT00210366</url>
  </required_header>
  <id_info>
    <org_study_id>IELSG21</org_study_id>
    <nct_id>NCT00210366</nct_id>
  </id_info>
  <brief_title>Salvage Therapy With Idarubicin in Relapsing CNS Lymphoma</brief_title>
  <official_title>Salvage Therapy With Idarubicin in Immunocompetent Patients With Relapsed or Refractory Primary Central Nervous System Lymphomas</official_title>
  <sponsors>
    <lead_sponsor>
      <agency>International Extranodal Lymphoma Study Group (IELSG)</agency>
      <agency_class>Other</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>International Extranodal Lymphoma Study Group (IELSG)</source>
  <oversight_info>
    <authority>Italy: Ministry of Health</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The main objective of the trial is to assess the therapeutic activity of idarubicin as
      salvage treatment in patients with recurrent or progressive lymphoma in the central nervous
      system.
    </textblock>
  </brief_summary>
  <overall_status>Terminated</overall_status>
  <why_stopped>
    due to slow accrual
  </why_stopped>
  <start_date>November 2004</start_date>
  <primary_completion_date type="Actual">July 2010</primary_completion_date>
  <phase>Phase 2</phase>
  <study_type>Interventional</study_type>
  <study_design>Allocation:  Non-Randomized, Endpoint Classification:  Safety/Efficacy Study, Intervention Model:  Single Group Assignment, Masking:  Open Label, Primary Purpose:  Treatment</study_design>
  <primary_outcome>
    <measure>objective response to treatment</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>duration of response</measure>
  </secondary_outcome>
  <secondary_outcome>
    <measure>overall survival</measure>
  </secondary_outcome>
  <secondary_outcome>
    <measure>acute side effects of idarubicin</measure>
  </secondary_outcome>
  <enrollment type="Anticipated">25</enrollment>
  <condition>Lymphoma, B-Cell</condition>
  <intervention>
    <intervention_type>Drug</intervention_type>
    <intervention_name>Idarubicin</intervention_name>
  </intervention>
  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:

          -  Histological or cytological diagnosis of non-Hodgkin's lymphoma

          -  Disease exclusively localised into the CNS at first diagnosis and failure

          -  Progressive or recurrent disease

          -  Previous treatment with HDMTX containing CHT and/or RT

          -  Presence of at least one target lesion, bidimensionally measurable

          -  Age 18 - 75 years

          -  ECOG performance status &lt; 3 (Appendix 1).

          -  No known HIV disease or immunodeficiency

          -  HBsAg-negative and Ab anti-HCV-negative patients.

          -  Adequate bone marrow function (plt &gt; 100000 mm3, Hb &gt; 9 g/dl, ANC &gt; 2.000 mm3)

          -  Adequate renal function (serum creatinine &lt; 2 times UNL)

          -  Adequate hepatic function (SGOT/SGPT &lt; 3 times UNL, bilirubin and alkaline
             phosphatase &lt; 2 times UNL)

          -  Adequate cardiac function (VEF ≥ 50%)

          -  Absence of any psycological, familial, sociological or geographical condition
             potentially hampering compliance with the study protocol and follow-up schedule

          -  Non-pregnant and non-lactating status for female patients. Adequate contraceptive
             measures during study participation for sexually active patients of childbearing
             potential.

          -  No previous or concurrent malignancies at other sites with the exception of
             surgically cured carcinoma in-site of the cervix and basal or squamous cell carcinoma
             of the skin and of other neoplasms without evidence of disease since at least 5
             years.

          -  No concurrent treatment with other experimental drugs.

          -  Informed consent signed by the patient before registration
      </textblock>
    </criteria>
    <gender>Both</gender>
    <minimum_age>18 Years</minimum_age>
    <maximum_age>75 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>
  <overall_official>
    <last_name>Andres JM Ferreri, MD</last_name>
    <role>Study Chair</role>
    <affiliation>San Raffaele Hospital - HSR Servizio di radiochemioterapia</affiliation>
  </overall_official>
  <location>
    <facility>
      <name>Servizio Radiochemioterapia - Ospedale San Raffaele</name>
      <address>
        <city>Milan</city>
        <zip>20132</zip>
        <country>Italy</country>
      </address>
    </facility>
  </location>
  <location_countries>
    <country>Italy</country>
  </location_countries>
  <link>
    <url>http://www.ielsg.org</url>
    <description>Click here for more information about this study</description>
  </link>
  <verification_date>July 2010</verification_date>
  <lastchanged_date>July 29, 2010</lastchanged_date>
  <firstreceived_date>September 13, 2005</firstreceived_date>
  <responsible_party>
    <name_title>International Extranodal Lymphoma Study Group</name_title>
    <organization>IELSG</organization>
  </responsible_party>
  <is_fda_regulated>No</is_fda_regulated>
  <has_expanded_access>No</has_expanded_access>
  <condition_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Lymphoma</mesh_term>
    <mesh_term>Lymphoma, B-Cell</mesh_term>
  </condition_browse>
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Idarubicin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been released for this study                              -->
</clinical_study>

At a minimum, I might need to extract the NCT number, the start date, the completion date, and the long title from this document. In XML::Rabbit, that's as simple as:

package Study;

use strict;
use warnings;

use XML::Rabbit::Root;

has_xpath_value long_title           => './official_title';
has_xpath_value nct                  => './id_info/nct_id';
has_xpath_value start_date           => './start_date';
has_xpath_value completion_date      => './completion_date';

finalize_class();

1;

That's it.

Seriously, that's it. Parse a document with my $study = Study->new( xml => $xml );. Access the attributes with accessors.

It gets better. I also need to extract contacts from the study so I can associate them with user accounts. XML::Rabbit can create nested objects too:

package Study;

use strict;
use warnings;

use XML::Rabbit::Root;

has_xpath_value long_title      => './official_title';
has_xpath_value nct             => './id_info/nct_id';
has_xpath_value start_date      => './start_date';
has_xpath_value completion_date => './completion_date';

has_xpath_object_list contacts  => './overall_contact|./overall_contact_backup'
                                => 'Study::Contact';

finalize_class();

package Study::Contact;

use strict;
use warnings;

use XML::Rabbit;

has_xpath_value email     => './email';
has_xpath_value last_name => './last_name';
has_xpath_value role      => './role';

finalize_class();

1;

For every node matching the XPath expression provided to contacts, XML::Rabbit will attempt to create a Study::Contact item ad associate it with the parent Study object.

I truly appreciate that these documents are plain old Moose objects. Anything you can do with Moose you can do with these classes—including adding methods (or applying roles or...) to add behavior to the objects.

Updating the parser classes to add a new element to extract takes thirty seconds, and that includes launching a new instance of my text editor. I added a couple of methods to report documents with missing fields and that took about ten minutes (it takes an order of magnitude longer to examine the documents than it did to write my code).

I'm still not sold on XML as a data interchange format, but the combination of XPath and Moose has certainly made my job much, much easier. I almost wish I'd thought of this trick.

Why the Modern Perl Book Uses autodie

By chromatic on October 24, 2012 10:52 PM | 7 Comments

The primary reason I wrote Modern Perl: the book is to help people new to Perl write Perl well. Every decision we made editorially had to support that reason. (See Why Modern Perl Teaches OO with Moose and Why the Modern Perl Book Avoids Bareword Filehandles, for two other examples.)

The start of the book suggests that all example programs in the book will work with the environment set up by:

use Modern::Perl '2012';
use autodie;

You've probably already heard of Modern::Perl and you may have heard of autodie.

Not everyone is a fan of either. Recently, laufeyjarson wrote Why I Dislike autodie. His first point—it changes the language—is strong and his second—it would work better if it were a core offering of Perl, rather than an extension—is even more compelling to me. The rest is preference, and as such, TIMTOWTDI applies.

I don't always use it myself, for example.

Why is it in the Modern Perl book? The most important reason is a combination of laziness and correctness: if novices apply the advice to use the pragma, Perl will handle error checking for them. This means no aping of code they might not understand fully, no opportunity for them to forget to check the result of an operation which might fail, and only consistency in the reporting of error messages.

In other words, it makes the failures they ought to be checking for apparent.

Should novices eventually learn which Perl builtins invoke systems commands and as such might fail? Yes. Should they eventually learn the value of consistent mechanisms for finding and reporting errors? By all means.

Is that the first thing they should learn?

(If you believe that the proper way to teach someone how to program is to throw a copy of K&R at them, because that's how you did it, and because anyone who doesn't have the patience to learn how manual memory management and pointers work shouldn't be a real programmer, you have a fine opinion. It's a wrong opinion and you should feel badly for holding it, but it's a fine opinion.)

For experienced programmers like you and me, autodie is an offered convenience. You don't have to take advantage of it, but it's available for you to do so. You and I both know how to do what it does in multiple ways and can decide which ways are better than others in whatever circumstances we find ourselves.

For someone who's just learning Perl, autodie is a way for the computer to take care of some of the important but no less tedious details that are most important when things go wrong. (If I had a dollar for every time someone failed to check that an open succeeded, I could pay someone to turn autodie into a core feature.)

It also makes didactic code shorter. (If I had a tachyon for every time I read a piece of technical writing which apologized "Error checking has been left as an exercise for the reader", I could invent a time machine to go back in time and thwack the writers in the head for that sin.)

Will the modern Perl police come and take away your keyboard if you don't use autodie in your own code? Even if they existed, they wouldn't. Yet the same way we don't let new programmers wander off and write control code for nuclear medicine control boards or space elevator design systems unsupervised, sometimes we should let the computer take care of some of the details until they get the hang of things. It's not that they are incapable of writing good error handling code. It's that they don't know yet how valuable it is.

Slow Tests are a Code Smell

By chromatic on October 22, 2012 8:56 PM | 1 Comment

I've been adding a test suite to a young project over the past couple of weeks. (Seems like a good use of my skills.)

Going from zero tests to the basic "does it compile?" test is a huge improvement, the same way that going from zero tests for a feature to one test for a feature is a dramatic improvement. You do get more value for each test you add, but novice testers don't always realize the nuance that there's a point of diminishing returns with test coverage.

Like code, tests have a cost. Every additional line of code is code that has to be maintained. If the value of the feature that enables is greater than the cost of maintenance, you've made a wise investment. Sometimes the opposite is true.

One of the most dramatic examples of cost I've seen in the past couple of weeks is the cost of time spent running the test suite. Every new test file added to the system adds about three seconds to the length of time required to run the entire suite. That's almost entirely setup time, dominated by disk IO. (The IO-heavy setup means that running the test suite in parallel actually takes longer than running the tests serially.) After two weeks of work, the test suite has gone from a few assertions which run (and fail) in a couple of seconds to over 1600 assertions which run in somewhere between 70 an 80 seconds.

The amount of coverage is decent, and the ability to write code and to change things without fear of breaking important features is very nice, but the longer the test suite takes to run, the less likely people are to run it before making commits. (This has happened.)

As my colleage Jim Shore wrote, a ten minute build is the golden standard for deployment, testing, and automation quality. (If I were to add anything, I would suggest that you should be able to check out the project on a fresh machine and generate a deployable version—completely tested—within ten minutes.)

Automation is step one. Test quality is another step—and getting the speed of the test suite back under 20 seconds is a goal for sure.

Disabling Perl with an Environment Variable

By chromatic on October 18, 2012 6:00 AM

Want to amuse your friends and vex your co-workers? Stick this environment variable somewhere in your system:

PERL5OPT=-Mbase='base;exit'

Want to lose at all benchmarks? Change exit to sleep100000 instead. (Just don't put a space anywhere in the payload.)

Test::Class versus Modern Perl OO

By chromatic on October 13, 2012 2:12 PM | 1 Comment

Three and a half years ago, my colleague Ovid wrote a series about organizing test suites with Test::Class. I've happily used Test::Class for years, for projects where reusability of test suites increased my confidence in the correctness of the code and reduced duplication in my work.

Test::Class works best when you can organize the structure of your test suite along the lines of your classes under test.

One drawback of Test::Class I've noticed lately is that it follows the so-called classical Perl style of object orientation, rather than the modern style. (The primary distinction I make between classical Perl OO and modern Perl OO is the degree of formalism in the structure of the OO system. In other words, an OO system that doesn't encourage or at least permit the use of roles is in no way a modern OO system.)

The problem is Test::Class's use of function attributes:

sub test_some_predicate :Tests {
    my $self = shift;
    ok $self->some_method, 'some_method() should return a true value';
    ...
}

While it's possible to use something like Role::Basic to provide roles of test assertions you can apply to your test classes, the amount of work necessary to convince Test::Class that these composed methods are part of the assertion framework is more work than I want to do.

The problem is that, in the absence of a well-defined Perl 5 metaobject protocol, Test::Class had to build its own mechanism and that mechanism turns out not to have been reusable enough with a modern Perl object system.

That's not the fault of the authors (who could have predicted this need to such a degree?), but it's another criticism of Perl 5's attributes mechanism which is tied too closely to Perl 5's rather ad hoc compilation model, which translated syntax into policy with too little regard for the user-defined semantics attached to extension syntax. To rephrase in another fashion, Perl 5's syntactic extension mechanisms work well enough within their compile-time lexical scope, but woe to you if you want to reuse them outside of that compile-time lexical scope. Perl's already moved well on by now.

I've used Test::Routine very productively in another project and may indeed migrate the test suite to the latter. In the mean time, the test suite (nascent though it may be after a week of work) does suffer a little bit of duplication and near duplication, ironically enough. A dose of multiple inheritance might solve the problem, but at some point a good designer has to ask whether the amount of abstraction required in a test suite is worth the result. Simplicity of debugging is a concern which outweighs many others.

Emulating Dynamic Scope with Lexical Destruction

By chromatic on October 10, 2012 9:37 PM | 2 Comments

This past week I've been writing a lot of tests for a project that has little test coverage. Fortunately it's a young project and it's not too messy, but adding tests after the fact is always more difficult than designing with tests from the start.

One of the pieces of this system uses introspection of package global variables to define class options. (It doesn't use Moose, at least not yet.) In other words, you might see something like:

package MyProj::Item;

our $lifetime    = 'request';
our $persistence = 'nosql';

The parent class of all of these children inspects those attributes and constructs instances based on their values. In some of the tests I've decided to manipulate those values to get as much coverage as possible before we rewrite pieces of the code. That means manipulating those global variables.

The problem with global variables is, as always, action at a distance. If I write to one in one place, what effect do I have on code which reads from it in other places? What effect do I suffer from code which also modifies it in other places?

One of the useful uses of Perl 5's local is to allow you to change the value of a global variable within a delimited lexical scope. One of my favorite uses of this technique is when debugging an array of items, to change the list delimiter global variable:

    local $" = '] [';
    diag "[@_]";

I use that technique sparingly in tests, because it's slightly more magical than I want my tests to get. In this case, because there are so many of these global variables ("many" meaning "more than three") and because other people will work on this test suite too, I wanted to encapsulate changes to these globals behind a nice interface. I try never to make messes in my code, but I'm even more careful when I know other people will have to work with it.

This presented an interesting problem: how do you localize changes to global variables and hide that localization behind a nice interface?

I ended up with an interface that looks like this:

sub test_some_behavior :Tests
{
    my $self  = shift;
    my $guard = $self->localize_options(
        lifetime    => 'singleton',
        persistence => 'session_store',
    );

    ...
}

... and all of the magic is in localize_options(), which works something like:

sub localize_options
{
    my ($self, %args) = @_;

    my @restore;

    while (my ($var, $new_value) = each %args)
    {
        my $ref = do
        {
            no strict 'refs';
            \${ $package . '::' . $var };
        };

        push @restore, [ $ref, $$ref ];
        $$ref = $new_value;
    }

    bless \@restore, 'RestoreOnDestroy';
}

package RestoreOnDestroy;

sub DESTROY
{
    my $self = shift;

    for my $restore (@$self)
    {
        my ($ref, $old_value) = @$restore;
        $$ref = $old_value;
    }
}

This method loops through all of the variable names, grabs references to the global variables, and then saves the references and the original values. Then it overwrites those values with the new values. Finally it blesses the array of references and old values into a very specific class.

That class does only one thing: when it goes out of scope, its finalizer iterates through that array and restores all of the old values. (You can see a generalization of this pattern in Scope::Guard, for example.

The consumer of this interface only has to remember two things: receive the guard object returned from the method and don't let it escape the smallest scope desired. All of the knowledge of manipulating Perl 5's symbol tables is in a single place (and strict gets disabled for only a single line of code). Better yet, if the interface to declaring these attributes changes from package globals to something else, only this method has to change at all, and only in a limited way.

All of this comes from the discipline of treating code in test suites with the same care as other code. The desire to create a clean interface is just as important—and encapsulation often more so—when writing code to demonstrate that what you expect really does happen the way you expect.

Also it's interesting to emulate one feature you can't use through another.

BEGIN-time Initialization versus Testing

By chromatic on October 8, 2012 5:32 PM | 1 Comment

Today I came across some undertested code, and Allison and I had a short discussion about the best way to approach it.

If you've read my You're Already Using Dependency Injection, you may approach the problem of testing from the mindset that looks for dependencies—declared and undeclared—and tries to manage them. That's essentially the design goal of automated testing, or at least test-driven development. We want to produce a single system with sufficient decoupling that we can prove small, unique, and isolated assertions about the behavior of our code.

Sometimes Perl makes that easy for us. Sometimes it doesn't.

Good programmers use Perl modules to encapsulate discrete units of behavior. Yet if we're not careful, we can limit our options. Consider this code:

package Project::DB::Connection;

use strict;
use warnings;

use DBI;
use Project::Config;

my $dbh = DBI->connect( Project::Config->get_config( 'database' ) );

...

The desire to make $dbh a singleton is understandable, as is the desire to encapsulate it as a file-scoped lexical. (Assume there's a get_dbh() function exported or available.) Those are very likely advantages.

You can't immediately see the disadvantages, however.

If something else in your system uses this module, Perl puts an implicit BEGIN around all of the code. That means before the use statement has finished executing, this code will already have run. $dbh will be connected.

As well, this code has a dependency on the configuration module. Presumably that gets loaded too, and it may itself run code.

If you need to override any part of the database connection (to use a separate database for testing, if you don't necessarily have access to the correct database on a testing machine, or for whatever reason), you have to hijack something and you have to make that happen before this code runs.

Deciding out what to hijack (do you mock the DBI? override part of the configuration? override get_dbh() and hope the initial connection works well enough in your testing environment?) is sometimes easier than figuring out what loads what else and when so that you can get the implicit order of loading correct. Perhaps even worse, you have to write really clever code to get Perl to do the right things in the right ways in the proper order. (Perl lets you do this, but just because you can do something doesn't make it a good idea.)

Compare that to the dependency management of lazy object attributes as exemplified by Moose.

Again, this may not be important to you until you want your test suite to run in an environment not exactly like the one you're using at the moment. (Shortcuts have a way of coming back to haunt you.) Even if you're perfectly happy in a simple environment you can't reproduce with the push of a button, the day may come when you want to know if your code works anywhere but your development machine.

That's when the degree to which you managed your implicit dependencies will either help or hinder you.

(That's why I make a habit of not running any code in implicit BEGIN blocks; it's cost me rework too many times.)

Solitude is a Drawback of Perl Productivity

By chromatic on October 5, 2012 9:18 AM

One of the drawbacks to being the only non-contract technical person working for Big Blue Marble is that a lot of the technical work that happens happens only because I do it.

(See Loaded for Werewolf, Why I Use Perl: Reliability, and Why I Use Perl: Testing for three discussions of Perl and productivity.)

While Perl's efficacy and whipupitude means that an experienced programmer like me can get a lot done (maintaining tens of thousands of lines of code with ease and growing projects while keeping them under tens of thousands of lines of code) and Perl's shallow (but lengthy) learning curve means that a novice programmer can get things done without having to learn much, and while all of this suggests that you don't need a small army of coders to accomplish many tasks, some of us do work in isolation. (See Perl without IRC.)

This is a truism of modern programming in general, not entirely specific to Perl. If you know the bare minimum of how to automate repeated tasks with a computer, you're a superhero.

You don't have to know about computability or the universal Turing machine or the lambda calculus or algorithmic notation or linked lists or—heaven help us all—pointers to turn a repetitive task that'd take a person a boring afternoon to do every month into something that an excited person can review in a couple of minutes and get to work doing something better.

... but that power can also be isolating.

I don't need to know how to write my own webserver (even though I've written a couple from socket programming on up) to deploy a robust application stack with Apache or nginx and Plack. Knowing how webservers work helps, but I can find a couple of good tutorials and download a few dependencies and go on to interesting things.

Maybe one of the reasons the global Perl community is so spread out and disconnected from the core Perl community is that we're just too productive. We're system administrators and automators and toolsmiths each working in these strictly focused niches and our tools are powerful enough that we don't always feel the need to work with other people to get things done.

Sometimes that means we make messes and spend far too long chasing the wild geese before we rein ourselves back in to something more useful.

Yet I suspect that we walk a balance between being just productive enough that we can avoid collaboration and having blinders on to the benefits of showing our work to other people.

(Polemic statement I can't quite support: the global PHP community is slightly more cohesive because it's almost a requirement that you browse php.net continually to figure out what you're supposed to be doing, or at least what will get you close enough.)

Code Injection with eval require

By chromatic on October 3, 2012 10:54 AM

What's wrong with this code?

package Some::Loader;

use strict;
use warnings;

sub import
{
    my $class = shift;

    for my $module (@_)
    {
        eval "require $module;"
    }
}

1;

See it yet? Here's a hint: what can $module contain?

Still don't see it? Run this command line:

$ perl -MSome::Loader='vars; exit' -E 'say "Hello, world!"'

For more fun, export the environment variable PERL5OPT.

(Yes, everyone knows that if you don't have control over the environment variables set on your machine you can do a lot of damage, but do you check all of the software you install for changes to every place where your environment can possibly come from? It's not just PERL5OPT that's the problem either.)

If you're passing arbitrary external data to eval STRING, you might execute code. At least consider using something like Module::Runtime's require_module() to load modules dynamically, as that distribution checks the sanity of data passed to verify that it's receiving something that looks like a module name and not an arbitrary Perl expression.

After all, you wouldn't pass arbitrary user data to a SQL statement without using placeholders, would you? (Oh wait, the Perl documentation recommends this idiom on the assumption that if you use it, any security problems are your own fault, you poor deluded fool.)

The Overhead of a Class

By chromatic on October 1, 2012 11:23 AM | 2 Comments

The only problem you can't solve by adding another layer of abstraction is the problem of too many layers of abstraction.

When I write code, I try to write simple code. I don't mean baby code (though it's okay if you're just learning to program). I mean code that does what it needs to do, is easy to understand if you know what it needs to do, and is easy to maintain if you understand the problem.

Everything should be in the right place. Everything should have a meaningful name. The organization should make sense and should suggest how to make meaningful and necessary changes.

(I usually have to let the architecture of a system emerge through guided trial and error and a couple of rounds of refactoring before I'm satisfied.)

I write a lot of Perl. Perl's very effective at allowing my projects to evolve in almost every way.

Almost.

Like most programmers I know, I struggle with the idea of primitive obsession. This might be more prevalent in languages with dynamic typing than with good static type systems. (Jim's example uses Java, one of the worst of all possible worlds, as you'll see. C is worse in this example.)

In simple terms, primitive obsession is what happens when we say "Oh, someone's name is just a string of two words" instead of representing a name with something that understands all of the information a name can contain (do you have a family name? a formal name? a title? a middle name? a multi-word first or last name? no last name? a cultural name distinct from a legal last name? a cultural or political organization of names in non-Western order?) and all of the operations you can perform on a name (casing, changing, sorting, searching, normalizing).

Primitive obsession is what happens when we say "I need to store a date, and as an American it's obvious that dates are always of the form MM/DD/YYYY." or "Sure, they'll have computers in 2000, but those extra two digits are super expensive right now."

Perl exacerbates this with syntax. (Bet you're surprised to see me write that.)

For better or worse, as with many modern languages, the best way to create abstractions over data and behavior is to create a class, and this is where Perl occasionally gets in my way:

package Some::Class
{
    use Moose;

    has [qw( some_attribute another_attribute )], is => 'ro', lazy_build => 1;

    sub _build_some_attribute    { ... }
    sub _build_another_attribute { ... }

    sub some_method              { ... }

    __PACKAGE__->meta->make_immutable;
};

Given all of that code necessary to create a new class (and thanks to Moose it's much less than it could be and much better than I would normally write by hand), I far too often say "It's simpler to use the primitive here because I can refactor it to a class later". Remember also that adding a new class means adding a new file and loading it, or dealing with the order of compilation (did I mention the advantages of declarative syntax yet?) when adding a new class inline. Yes, some of those reasons are semantic and not syntactic, but don't overlook the syntax.

That decision doesn't always cause problems in the future, but it causes enough problems that it's a risk. (DBIx::Class deserves tremendous credit for including DBIx::Class::InflateColumn::DateTime as a core module. Only heroes get date and time calculations right, and Dave Rolsky is a hero for that and countless other thankless reasons.)

I conclude from this a few lessons:

The overhead of declaring a class is still higher than I would like
I am tremendously lazy and bad about predicting the future
Abstraction is costly in terms of design, but it often is a good investment

(I idly wonder how someone might design a functional approach to the same problem, and then I get lost in the question of declaring closures for anonymous functions that has an attractive syntax.)

« September 2012 | Main Index | Archives | November 2012 »

October 2012 Archives

I Stopped Parsing XML Thanks To XML::Rabbit

Why the Modern Perl Book Uses autodie

Slow Tests are a Code Smell

Disabling Perl with an Environment Variable

Test::Class versus Modern Perl OO

Emulating Dynamic Scope with Lexical Destruction

BEGIN-time Initialization versus Testing

Solitude is a Drawback of Perl Productivity

Code Injection with eval require

The Overhead of a Class

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Archive