When I wrote Plack::Middleware::BotDetector, I planned to use it only for filtering out non-humans from our cohort analysis system. (You can read the entire rationale and explanation in Detecting Bots and Spiders with Plack Middleware.)
Since I wrote that article, I extracted that middleware from our project and released it on its own as Plack::Middleware::BotDetector. As is often the case, solving one problem suggests the possibility of solving multiple problems.
When I build systems that analyze data, I try to make it possible that the
analysis can improve over time. Anomalous cases should be obvious and easy to
correct and, when corrected, should no longer be obvious (because they're no
longer anomalous). When detecting non-human user agents, we analyze our access
logs for likely candidates to add to the list used to construct the regex
passed to Plack::Middleware::BotDetector
.
I had written a small Perl program to analyze our logs and give a histogram of user agents, but I still ended up eyeballing that list to see if any new bot user agents had appeared. (You can tell your SEO strategy is working when you get more bot traffic.)
Anytime you find yourself reviewing data by hand, ask yourself if a computer can do it.
We use Plack, obviously. Plack::Runner enables default middleware, including Plack::Middleware::AccessLog. That's responsible for writing an access log (you can configure the location), and we used that because it was easy and available.
"Wait," I asked myself. "Why am I reviewing this log information when I have to remember to exclude most of it because I already know it's bot traffic?" More important, our system already knows it's bot traffic, because the BotDetector middleware is already excluding those requests our from cohort analysis event logging.
What if I used the BotDetector to decide whether to log a request's information? (We don't do anything with these access logs which requires us to keep data about bot traffic.) That way, every update to the BotDetector regex would exclude more and more bot traffic, and the only things we'd see in our daily reports would be real users and bots we needed to exclude.
I wrote a custom piece of middleware in about two minutes:
package MyApp::Plack::Middleware::AccessLogNoBots;
# ABSTRACT: Plack middleware which only logs non-bot requests
use Modern::Perl;
use parent 'Plack::Middleware::AccessLog';
sub call
{
my $self = shift;
my $env = $_[0];
return $env->{'BotDetector.looks-like-bot'}
? $self->app->( $env )
: $self->SUPER::call( @_ );
}
1;
This class extends the AccessLog
middleware class to override
the call()
method. If the request looks like it came from a
spider, it passes through the request to the next middleware. Otherwise, it
lets the parent class log the request.
Installing this in our .psgi file was more difficult than writing
the class, which says more about how easy it was to write this class than
anything else. The only complicating factor is that Plack::Runner
takes the responsibility for setting up its AccessLog
component. I
ended up with something like:
use MyApp;
use MyApp::BotDetector;
use Plack::Builder;
use Plack::App::File;
my $app = builder
{
enable 'Plack::Middleware::BotDetector',
bot_regex => MyApp::BotDetector::bot_regex();
enable 'Plack::Middleware::ConditionalGET';
enable 'Plack::Middleware::ETag', file_etag => [qw/inode mtime size/];
enable 'Plack::Middleware::ContentLength';
if ($ENV{MA_ACCESS_LOG})
{
open my $logfh, '>>', $ENV{MA_ACCESS_LOG};
$logfh->autoflush( 1 );
enable '+MyApp::Plack::Middleware::AccessLogNoBots',
logger => sub { $logfh->print( @_ ) };
}
MyApp->apply_default_middlewares(MyApp->psgi_app);
};
... where the presence of the environment variable governs the location of
the access log file. I also changed the scripts we use to launch this
.psgi file to pass the --no-default-middleware
flag to
Plack::Runner
.
The results have been wonderful (except that our site looked a lot busier before, when the logs showed Baidu spidering the whole thing at least twice a day). The decorator pattern of Plack continues to demonstrate its value, and the cleanliness of extension and ease of writing this code argues yet again for putting conditionals (log or don't log) where they belong.
All I could ask for is a little more customizability for
Plack::Runner
to make some of the code in my .psgi file
go away, but I'm probably at the point where it makes sense to avoid
plackup
and write my own program which calls
Plack::Runner
directly.
Update: Miyagawa pointed out that Plack::Middleware::Conditional offers an alternate way to accomplish the same thing without writing custom middleware:
my $app = builder
{
enable 'Plack::Middleware::BotDetector',
bot_regex => MyApp::BotDetector::bot_regex();
enable 'Plack::Middleware::ConditionalGET';
enable 'Plack::Middleware::ETag', file_etag => [qw/inode mtime size/];
enable 'Plack::Middleware::ContentLength';
enable_if { ! $_[0]->{ 'BotDetector.looks-like-bot'} } 'AccessLog';
MyApp->apply_default_middlewares(MyApp->psgi_app);
};
We didn't use this technique because of the way we wanted to handle the log
file, but that's what the Conditional
middleware is for.