If you analyze the user requests of your web site, you'll have to deal with enormous numbers of bots and spiders and other automated requests for your resources which don't represent measurable users. As promised in Annotating User Events for Cohort Analysis, here's how I handle them.
I wrote a tiny piece of Plack middleware which I enabled in the .psgi file which bundles my application:
package MyApp::Plack::Middleware::BotDetector;
# ABSTRACT: Plack middleware to identify bots and spiders
use Modern::Perl;
use Plack::Request;
use Regexp::Assemble;
use parent 'Plack::Middleware';
my $bot_regex = make_bot_regex();
sub call
{
my ($self, $env) = @_;
my $req = Plack::Request->new( $env );
my $user_agent = $req->user_agent;
if ($user_agent)
{
$env->{'BotDetector.looks-like-bot'} = 1 if $user_agent =~ qr/$bot_regex/;
}
return $self->app->( $env );
}
sub make_bot_regex
{
my $ra = Regexp::Assemble->new;
while (<DATA>)
{
chomp;
$ra->add( '\b' . quotemeta( $_ ) . '\b' );
}
return $ra->re;
}
1;
__DATA__
Baiduspider
Googlebot
YandexBot
AdsBot-Google
AdsBot-Google-Mobile
bingbot
facebookexternalhit
libwww-perl
aiHitBot
Baiduspider+
aiHitBot
aiHitBot-BP
NetcraftSurveyAgent
Google-Site-Verification
W3C_Validator
ia_archiver
Nessus
UnwindFetchor
Butterfly
Netcraft Web Server Survey
Twitterbot
PaperLiBot
Add Catalog
1PasswordThumbs
MJ12bot
SmartLinksAddon
YahooCacheSystem
TweetmemeBot
CJNetworkQuality
YandexImages
StatusNet
Untiny
Feedfetcher-Google
DCPbot
AppEngine-Google
Plack middleware wraps around the application to examine and possibly modify the incoming request, to call the application (or the next piece of middleware), and to examine and possibly modify the outgoing response. Plack conforms to the PSGI specification to make this possible.
Update: This middleware is now available as Plack::Middleware::BotDetector from the CPAN. Thanks to Big Blue Marble and Trendshare for sponsoring its development and release.
All of that means that any piece of middleware gets activated by something
which calls its call()
method, passing in the incoming request as
the first parameter. This request is a hash with specified keys. The
application, or at least the next piece of middleware to call, is available
from object's accessor method app()
.
(I'm lazy. I use Plack::Request to turn
$env
into an object. This is not necessary.)
The rest of the code is really simple. I have a list of unique segments of the user agent strings I've seen in this application. I use Regexp::Assemble to turn these words into a single (efficient) regex. If the incoming request's user agent string matches anything in the regex, I add a new entry to the environment hash.
With that in place, any other piece of middleware executed after this point in the request—or the application itself—can examine the environment and choose different behavior based on the bot-looking-ness if any request. My cohort event logger method looks like:
=head2 log_cohort_event
Logs a cohort event. At the end of the request, these get cleared.
=cut
sub log_cohort_event
{
my ($self, %event) = @_;
return if $self->request->env->{'BotDetector.looks-like-bot'};
$event{usertoken} ||= $self->sessionid || 'unknownuser';
push @{ $self->cohort_events }, \%event;
}
The embolded line is all it took in my application to stop logging cohort events for spiders. If and when I see a new spider in the logs, I can exclude it by adding a line to the middleware's DATA
section and restarting the server.
(You might rather store this information in a database, but I'd rather build
the regex once than loop through a database with a LIKE
query. I
haven't found an ideal alternate solution, which is why I haven't put this on
the CPAN. Perhaps this is two modules, one for the middleware and one which
exports a regex to identify spider user agents.)
There's one more trick to this cohort event logging: traceability. That's the topic for next time.
My take away is that there is probably a need for a "Acme::RobotsRegex" module on cpan? Perhaps better named.
I am also sitting here thinking about how interesting it would be to do monitoring functions via Plack plugins. Like latency monitoring, or perhaps cross site scripting detection. Functions like those in "introscope" but without involving CA
There should be a space between "package" and "MyApp" on line 1 of the package source. :)
Nice use of Regexp::Assemble and a __DATA__ section! Although I have yet to play with Plack, I like this idea of using middleware for some intermediary logic.
I like the idea of a middleware module for this. But wouldn't it be better to use (and perhaps update) one of the existing modules for identifying bots?
Neil Bowers wrote this nice article
http://neilb.org/reviews/user-agent.html
I didn't do this for two reasons.
First, I didn't find these modules (or the review) on my initial search of the CPAN. Second, Neil's conclusions show that the tradeoff between accuracy and speed is awful for the cases he saw. What I have now is definitely not optimal from a code and reuse standpoint, but it doesn't slow down every request dramatically.
I tested your code against more than 10000 user agent strings (from http://useragentstring.com/pages/All/). On my laptop it took about quarter of second to filter out 80 bot strings:
Not bad at all, thank you!
this sounds way too complex and heavy
need a lightweight way to detect NEW bots ... from their behavior
it it has to run on every page load it would nead to be EXTREMELY lightweight .. ie no heavier that hitting one or two small files ... no mysql even .. so as not to add significant latency to page loads and allow it to quickly detect and block some of those really nasty ones with insane request rates!
so the detection methods will need to cope with pretty hardcore request rates long enough to detect and block!
Michael, you could try including three hidden links somewhere on your homepage a la:
An IP address that hits more than one of those links is flagged as a suspected bot.