Improve Your Extracted Traversal

The code from Extract Your Traversal solved my immediate problem elegantly. It had the feel of discovering the secret behind a puzzle which previously seemed difficult—like when you first understand that calculus is about calculating infinitely small things by subdividing your measurements into infinitely small pieces which you can reason about as a group.

I like those realizations, especially in code.

Breaking Reference Circularity

Contrary to what I wrote, the code does have a little more clunkiness than I liked due to practical considerations. The code as I presented it there had a memory leak. (I know; how unfair of me to leave it in there for you to find on your own, but that would have made the article even longer.) In particular, because the traversal function closes over the action dispatch hash and the action dispatch hash entries close over the traversal function, Perl 5's simple reference counting garbage collector will never collect either data structure.

Fixing this is easy in three ways:

Use weak references (via Scalar::Util)
Break the cycle manually
Embrace continuation passing

Perhaps the most well-known approach is to use weak references, but I prefer the second option. Here's the resulting code:

sub html2text {
    my ( $html ) = @_;

    my $tree = HTML::TreeBuilder->new_from_content($html)->elementify();
    my $top  = $tree->find_by_tag_name( 'body' ) || $tree;

    # declare here to close over in hash
    my $traverse;
    my %actions = (
        p => sub {
            my $text = $traverse->( shift );
            return '' unless $text;
            return $text . "\n\n";
        },
        a => sub {
            my $node = shift;
            my $text = $traverse->( $node );
            my $link = $node->attr( 'href' );

            return $text . ":\n\n" . $link . "\n\n";
        },
    );

    $traverse = sub {
        my $node = shift;
        my $text = '';
        for my $child ($node->content_list()) {
            if (ref $child) {
                my $tag    = $child->tag();
                my $action = $actions{ $tag } || $traverse;
                $text .= $action->( $child );
           } else {
                $text .= $child;
           }
        }

        return $text;
    };

    my $text = $traverse->( $top );

    $text =~ s/^\s+//g;
    $text =~ s/\s+$//g;

    # break the cycle
    undef $traverse;
    return $text;
}

By undefining the traversal function, Perl can collect that and reduce the reference count to the hash.

In truth, the weak reference code is very difficult to get correct on its own because of the anonymity of the traversal function: it closes over itself.

Recursing Anonymously

One solution is to make that function a named function. That's a fine solution, but it felt wrong to me, as I was trying to avoid making a class to follow the visitor pattern I've used in other contexts. (If you prefer making named functions, great! That's more than acceptable.)

Perl 5.16's current_sub feature offers an alternative which works with both named and anonymous functions. Instead of forcing the anonymous function to close over itself, it can refer to itself with __SUB__:


    $traverse = sub {
        my $node = shift;
        my $text = '';
        for my $child ($node->content_list()) {
            if (ref $child) {
                my $tag    = $child->tag();
                my $action = $actions{ $tag } || __SUB__;
                $text .= $action->( $child );
           } else {
                $text .= $child;
           }
        }

        return $text;
    };

That's one potential memory leak gone, but the reference to $traverse in the hash and vice versa still requires either explicit undef $traverse or weak ref handling. Is there another way?

Controlling Execution

When I originally thought of this approach, I approached it like a compiler writer might (you can hang up your hat, but it still probably fits). I wanted a different kind of control flow than Perl usually gives.

I planned to pass $traverse as a parameter to all of the action functions in the %actions hash. I stopped before I did that because I realized the code as written works just fine as it is, but what if I'd gone that far?


    my %actions = (
        p => sub {
            my ($node, $traverse) = @_;
            my $text = $traverse->( $node );
            return '' unless $text;
            return $text . "\n\n";
        },
        a => sub {
            my ($node, $traverse) = @_;
            my $text = $traverse->( $node );
            my $link = $node->attr( 'href' );

            return $text . ":\n\n" . $link . "\n\n";
        },
    );

    $traverse = sub {
        my $node = shift;
        my $text = '';
        for my $child ($node->content_list()) {
            if (ref $child) {
                my $tag    = $child->tag();
                my $action = $actions{ $tag } || __SUB__;
                $text .= $action->( $child, __SUB__ );
           } else {
                $text .= $child;
           }
        }

        return $text;
    };
}

Now all of the variables in each of the actions are truly lexical. They don't close over any scopes but their own. Their (recursive) control flow truly depends on the function references they receive. There are no memory leaks in this code (using the current_sub Perl 5.16 feature) and there's no need for explicit code to work around circular references.

The code is a little bit longer and, in my mind, a little bit uglier than the previous version. It is, however, a little bit more flexible. It allows different traversal mechanisms in different handlers. (A similar technique could allow different formatting mechanisms in different handlers, perhaps to change indentation levels or font display or whatever you might prefer.) Someone either more clever or motivated might even figure out a way to perform tail call elimination in certain cases.

Aside from the fix to the memory leak, I left the code as it is. (While we're very likely to migrate to Perl 5.16 by the end of January for Unicode improvements, we haven't done so yet, so the obvious __SUB__ improvement isn't yet an option.) I've spent a while thinking about it, and my conclusion is this:

It's fun to ponder alternate mechanisms of writing the same code and to analyze them for strengths and weaknesses, but when I reached a solution that I could explain to the other developers on the project and which met our needs, I stopped. I don't need to generalize a new mechanism of control flow we use elsewhere in the project, at least for this. (I did the other day for a separate task, but that's a story for another time.) I only need to make the code strong enough that it does the job without known flaws while keeping it clear enough that it's sufficiently obvious how it does what it does.

This code is open to modification to add new handlers. So far we've only needed two, but they're easy enough to write. They don't require any additional memory management code that they might, if I had used weak references. (The more adornment your code requires to satisfy the runtime, the easier it is to get things wrong when you have to modify it or copy, paste, and modify.) The traversal function is also easy enough to understand and to modify if necessary (though that seems very unlikely).

As much as the little compiler writer in my head who's studied things like Haskell and Scheme and closure optimization in JavaScript would like to go off on wild tangents to explore just how far it can push some of these ideas, the Perl programmer in my head steals techniques wherever it can find them, catalogs them as tools and patterns, and holds me accountable for writing sufficient and necessary code.

I'm glad I have those tools, but I'm even more glad for a language and ecosystem and community with such a practical and pragmatic mindset.

Breaking Reference Circularity

Recursing Anonymously

Controlling Execution

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry