When you're collecting data from the wild and wooly Internet at large, you never know exactly what you're going to find.
This article is the third in a series which started with Annotating User Events for Cohort Analysis and continued in Detecting Bots and Spiders with Plack Middleware.
By the end of the previous article, I had a pretty good solution to the problem of detecting spiders and bots I'd seen before. I have confidence in the test suite I wrote to prove that my event logger ignores those user agents, so that any events they trigger in the web application do not get logged.
What I don't know is the identifying information of bots and spiders I haven't seen yet.
Any well behaved bot or spider should request robots.txt before beginning to crawl the site. Not all do. Worse, not all requests for that file belong to bots or spiders. I've looked at that file on various servers in a normal web browser.
I pondered the idea of logging requests for favicon.ico as evidence that a real user agent made a request, but that feels unreliable too.
If you saw either of my talks this summer (When Wrong is Better at YAPC::NA 2012 and How and When to Do It Wrong at Open Source Bridge 2012), you may have heard me say that I try to design my software so that it's idempotent. If you run it a second time on the same dataset, you should get the same results. There's no shame in running it a second time. Nothing goes wrong.
(I go further: it should be quick to run again, and if you've improved the data, you should get better results. This should be obvious, but too much software destroys the incoming data when the transform-filter-report approach is safer.)
Analysing the event log for patterns and reports is relatively fast. It's currently fast enough that re-running the past week or two of reports is fast enough. (It could be faster, but it's fast enough now, and that's fine for now.) I don't mind a few bots getting through and having their events logged because I can remove them and regenerate the reports.
I can identify them because I have a notes field in the event
log database table:
CREATE TABLE cohort_log
(
    id         INTEGER      PRIMARY KEY AUTOINCREMENT,
    usertoken  VARCHAR(255) NOT NULL,
    day        INTEGER      NOT NULL,
    month      INTEGER      NOT NULL,
    year       INTEGER      NOT NULL,
    event      TEXT(25)     NOT NULL,
    notes      VARCHAR(255) DEFAULT ''
);When the application processes a request and that request triggers the
logging of an event, the log_cohort_event() method gets called.
The previous article showed an earlier version of this code. The deployed code,
as of this writing, is actually:
=head2 log_cohort_event
Logs a cohort event. At the end of the request, these get cleared.
=cut
sub log_cohort_event
{
    my ($self, %event)  = @_;
    my $env             = $self->request->env;
    return if $env->{'BotDetector.looks-like-bot'};
    unless ($self->sessionid)
    {
        $self->create_session_id;
        $event{notes} = $env->{HTTP_USER_AGENT};
    }
    $event{usertoken} ||= $self->sessionid;
    push @{ $self->cohort_events }, \%event;
}You've seen half of this code before. If the custom Plack middleware has detected that the request came from a spider or bot, it's set the appropriate flag in the PSGI environment hash.
If the middleware hasn't set that flag, the request may have come from a human.
If the request has no active session—whether it's the first time a user has appeared on the site or whether the user refuses session tracking—the system needs a session id. This token is part of the cookie sent to users and the index into the server-side storage of session data. It's also the unique identifier for individuals within the event log. (I could track something like a user id, but users who don't have accounts on the system won't have those, and it's far too easy to correlate user ids with real people, and all of the owners of the company have agreed that we'll only ever do data mining on anonymized data. If it's too easy to correlate user actions with session ids—if someone ever finds their way into the server-side session storage in the small window when there's any identifying information in there—I'll hash the session ids when I use them as tokens. It hasn't been a problem yet, and I don't foresee it as a problem.)
Oh, there's one more reason why a request without a session ID might exist: it might come from a bot or spider. In my experience, most automated processes ignore session cookies.
In any case, at the point of identifying a request which looks like it may have come from a real user, the system adds the request's user agent to the event's notes field.
I haven't yet written the code to grab unique user agents out of this field every day, but that's trivial. (It's just one more report, and you can probably write the SQL for it in your head, much less the DBIC code.) If that report orders by the number of occurrences of that field, it's almost trivial to pick out more user agents that look like likely bots. Then we can do two things: filter out those requests from the event logs and re-run the reports and add those user agent strings to the Plack middleware that detects bots.
The process isn't completely automated, but it's automated enough that only a little bit of human interaction can polish the system such that it gets better every day. We can't prevent exceptional or undesired events from happening, but we can identify them, then remove them from the system.
In the absence of a perfect system with perfect knowledge, I'll take a robust system we know how to improve.

