A lot of the advice given in Modern Perl: the Book is advice learned the hard way, whether through making my own mistakes, debugging code written on my teams, or reviewing code written by other people to help novices become better programmers. After a decade-plus of this experience, I think I've developed a good sense of what people find confusing and what problems rarely occur in practice.
The pervasive use of global variables? It'll eventually catch up to you. Variables popping into existence upon use, not declaration? It'll cause problems far sooner than you ever expect.
Clobbering $_
at a distance inside a map
block? It happened to me the other day. Yes, it surprised me too.
I've been attaching the Perl search bindings Lucy to a document processing engine. As part of the processing stage, my code adds documents to the search index. The index schema keeps track of specific fields, and it's denormalized slightly to decouple the document database from the search index. The code to add a document to the index creates a hash from method calls on each document object. That's where things started to go wrong:
sub add_entries
{
my ($self, $entries) = @_;
my $index = $self->index;
for my $entry (@$entries)
{
my $fields =
{
map { $_ => scalar $entry->$_() } keys %index_fields
};
$index->add_doc( $fields );
}
$index->commit;
$self->clear_index;
}
I noticed things went wrong when my test bailed out with strange errors.
Lucy was complaining about getting a hash key of ''
, the empty
string. I was certain that %index_fields
was correct.
While most of the methods called are simply accessors for simple object properties, these document objects have a method called short_content()
:
sub short_content
{
my $self = shift;
my $meth = length $self->content > $MIN_LENGTH ? 'content' : 'summary';
my $content = $self->$meth();
return unless defined $content;
my $splitter = Lingua::Sentence->new( 'en' );
my $total_length = 0;
my @sentences;
for my $sentence ($splitter->split_array( $content ))
{
push @sentences, $sentence;
$total_length += length $sentence;
# must be a big mess, if this is true
return $self->summary
if @sentences == 1
and $total_length > $MAX_SANE_LENGTH;
last if $total_length > 0.65 * $MAX_LENGTH;
}
if (@sentences)
{
my $text = join ' ', @sentences;
return $text if length $text > $MAX_SENTENCE_LENGTH && $text =~ /\S/;
}
return substr $content, 0, $MAX_SHORT_CONTENT_LENGTH;
}
A document may have a summary. It has content. short_content()
returns a slice of the first significant portion of either, depending on which
exists. While it's not the most detailed portion of the document, it's the
earliest significant portion of the document, and it's demonstrably the best
portion to index as a summary. (Thank you, inverted
pyramid.)
The rest of this method attempts to break the short content at a sentence boundary, so as not to cut it off in the middle of a word or thought.
Nothing in that method obviously clobbers $_
, but something
called from it apparently does. (I wonder if Lingua::Sentence or
one of its dependencies reads from a file or performs a substitution.)
Regardless, I protected my precious hash key with a little defensive
programming:
my $fields =
{
map { my $field = $_; $field => scalar $entry->$field() }
keys %index_fields
};
... and all was well.
While this has been a very rare occurrence in 13+ years of Perl 5 programming, the trap is particularly insidious. The more work Perl does within that map
block, the greater the potential for action at a distance. Furthermore, the better my abstractions—the more behavior hidden behind that simple method call—the greater my exposure to this type of bug.
Throw in things like dependency injection and Moose lazy attributes where you don't have to manage the dependency graph of your data initialization and flow yourself (generally a good thing) and you can't tell what's going to happen where or when.
If my instincts are correct, and something reads from a file somewhere such
that only the first call to construct a Lingua::Sentence
object clobbers $_
, the point is doubly true.
(Sure, changing map
to autolexicalize $_
would fix
this problem, but it has backwards compatibility concerns for people who
rely on this action at a distance—remember, it's only a problem
if you write to $_
—and it's too late to get such a
change into Perl 5.16, even if it were only enabled with use
5.016;
. Meanwhile, a variable assignment fixes it for me right now, and
that will have to suffice.)
Here's hoping 5.018
Or a bugfix to the module that's clobbering $_.
I could tell from reading your post that you know exactly where $_ is being globbered :) You're very diplomatic; I like it!
That's pretty weird. Given that => is just a synonym for the , operator, and the comma operator "[i]n list context, [is] just the list argument separator, and inserts both its arguments into the list. These arguments are also evaluated from left to right", I'd expect the $_ => ... to be evaluated before the $entry->$_()
But it's not, and very strangely so:
perl -MModern::Perl -e 'sub f { say "_ is $_"; $_++ }; $_ = 0; say join( ", ", $_, f(), $_, f(), $_, f(), $_, f() );'
_ is 0
_ is 1
_ is 2
_ is 3
4, 0, 4, 1, 4, 2, 4, 3
It works the same whether you use $_ or $x: the variable is evaluated at the end of the expression.
Using tie to see what runs when is illustrative:
perl -MModern::Perl -e 'package Tied; my $x = 0; sub TIESCALAR { bless {}, "Tied" } sub FETCH { say "FETCH $x"; $x } sub STORE { my ($this, $value) = @_; $x = $value; } package main; my $y; tie $y, "Tied"; sub f { say "y is $y"; $y = $y + 1; }; sub n { print "\n"; () } $y = 0; say join( ", ", $y, f(), n(), $y, f(), n(), $y, f(), n(), $y, f(), n() );'
FETCH 0
y is 0
FETCH 0
FETCH 1
FETCH 1
y is 1
FETCH 1
FETCH 2
FETCH 2
y is 2
FETCH 2
FETCH 3
FETCH 3
y is 3
FETCH 3
FETCH 4
FETCH 4
FETCH 4
FETCH 4
FETCH 4
4, 1, 4, 2, 4, 3, 4, 4
As we can see, the value of $y is fetched four times, right before running join, not interleaved with the calls to f().
... Or was that something that everyone but me already knew? :)
In your example I think you'll find that the fetches are happening from within the join function rather than right before it. The function will get the variable not the value of the variable (but it will get the result of those f(), n() functions first).
In a simple map such as... map { $_, f($_) } ... it is less obvious but is the same situation: the actual map block is returning a list containing $_ (rather than the value of $_), and whatever happens after that is fetching the value.
thirty5penguins: Indeed, some experimentation shows you've got it exactly right. Thanks for the explanation, it all makes much more sense now.
(actually in the map example I suppose it is the implicit return that fetches the value)
Want something weirder? Interpolate $_ into its own string as the hash key and the clobbering still happens.
So, has a ticket been filed, or the author notified yet?
...according to the docs, an alternative tracker is preferred to RT, so I suppose on the next update, the meta data could be added to tell search.cpan.org where the preferred tracker is.