The Hasty Generalization of Language Features

I suspect, but cannot yet prove, that one of the reasons for dissatisfaction with modern programming languages as well as one of the reasons that the call for breaking backwards compatibility is that it's difficult to predict what people will use a language for and, as such, it's nigh unto impossible to get the language's API right the first time. Reality is that which, even if your programming language will not admit it, never goes away.

(I own an otherwise good Haskell book which uses a custom mathematical notation for Haskell operators. You cannot actually type this notation and have Haskell accept your code. Instead you must flip to the back of the book for a translation table between the author's preferred typographic notation and what the Haskell language actually supports. Also the book has typos.)

Because of the mathematical foundations of programming, there's a long-standing trend to reduce any programming language to a simple and consistent and irreducible set of independent axioms. You sometimes hear of this as the theoretical axis of programming and sometimes "the formal core of a language". In theory—the land of programming language research, where the practical use of a language is less important than the degree to which a language explores a new or interesting design principle—that formal core is all important. It allows you to reason about a thing.

In practice, most programmers want to accomplish a task without having to digest sizable chunks for the Principia Mathematica. Then again, they also want to learn a few distinct ideas about the language's underlying philosophy (though only by osmosis, which is a subject of a different article) so that they can reuse those ideas in other parts of their programming.

In other words, most people approach programming from a similar point of view they arrived at from different directions. The theorists want to start from a small set of axioms and reason outward, while the practitioners want to learn only a little bit and gradually absorb the rest while they need it. (Okay, most of them don't want to absorb anything, but they do, and I'm happy to evaluate their wants based on their behaviors and not what they say.)

The result sometimes works. Other times, it leads to hasty generalizations.

Consider Perl's taint mode.

With taint mode enabled, all external data has a taint associated with it. If you use tainted data in an insecure way, Perl will complain. Before you can use this data safely, you must untaint it. (Who said Perl doesn't have a type system?)

So far so good.

How do you untaint data? You extract part of it with a regular expression capture.

Back up from the practical implementation for a moment and consider the language theoretical axis:

We have data marked as tainted
We must untaint that data before using it in secure operations
Untainting that data implies validating it somehow
We can use regular expressions to assert properties about data
Therefore, the right way to untaint data is to apply a regular expression and capture a subset of that data

You can see the reasoning, but you can also see the leap of logic in the fourth point. Yes, you can use regular expressions to validate data, but it's neither the only way to validate data as trustworthy, nor is that the only purpose of using a regex capture group.

For example, a web application might provide a URI something like /studies/010874/review where 010874 is the primary key of a study table. Before using a client-supplied value in a database query, you might rightly want to validate that that key is safe.

A very simple untainting might check that that number is composed soley of digits. That might not be sufficient. (If you're using a form processing module, it might have already done this for you if you specified the input parameter as a positive integer.)

A valid id must match an actual database record. It might need to match a database record where another column (active, for example) has a true value.

You cannot (easily or sensibly; I know about executing arbitrary code in a regex, and if you call out to a database from there, you'd better be showing off and not serious) encode this kind of validity checking into a regex. There's simply not enough mechanism there to express the necessary intent.

... or the aforementioned form processing module might have performed a very simple regex and untainted the value for you already without intending to.

Likewise a date might match a date-handling regex even if that date is in the future and, in your application, invalid.

The PerlMonks thread Taint Mode Limitations makes this point as well. A separate untaint builtin—and no implicit untainting from capture groups—would allow programmers to express their intent more clearly. In the case of writing secure code, the clarity of intent seems much more valuable than the desire to reuse an existing feature and save a new keyword.

Update: I had forgotten about Taint::Util, which does the right thing for the minor cost of installing a CPAN module.

The Hasty Generalization of Language Features

Tags:

1 Comment

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry