"Hmm," I found myself thinking the other day. "I've found and fixed quit a few potential bugs in this client application related to Unicode. Allison and I just went through and normalized user input so as to avoid casefolding errors. I wonder what happens if I try to register with a UTF-8 password."
Like all applications with a decent security policy, this application immediately hashes user passwords (it uses SHA-1 hashing instead of Bcrypt, but one thing at a time). When it creates a new user record, it uses Perl's Digest::SHA to hash the password before storing it in the database. When a user attempts to log in, the application performs a database query to look up the provided email address and the password, with SQL something like:
SELECT person_id FROM person WHERE primary_email = ? AND passphrase = sha1(?);
The assumption seemed reasonable; because SHA-1 is an algorithm with its details widely published and implemented, both PostgreSQL and Perl should provide the same hash, given the same input.
I took Tom's example from the Perl Unicode Cookbook's casefolding recipe (because I felt like this work was the data equivalent of rolling a boulder up a hill) and added a case to our registration tests with a password of Σίσυφος.
Boom.
Digest::SHA1
croaked, complaining about wide characters.
I looked over the code again. I'd enabled UTF-8 literals. I'd saved the file with the proper encoding. We'd fixed the encoding of input and we were normalizing all input to the NFC form. Everything looked right.
Then, buried in the documentation of Digest::MD5, I found a reference that suggested that that module explicitly does not handle wide characters—that it only works on strings of 8-bit characters. Anything outside of Latin-1 is just out.
The documentation suggested explicitly transcoding a UTF-8 string to Perl's internal octet-based encoding, then performing the digest...
... but when I did that, Perl and PostgreSQL disagreed about the resulting hash.
The super nice thing about standards is what they don't mention about the assumptions they make, and how they leave those assumptions up to implementations, and how when people try to do the right thing and run right up against those assumptions, sometimes they find out the difficult way that competing implementations have chosen very different approaches.
I spent the rest of the afternoon chasing down every place in the source code which hashed passwords in the Perl layer and changed them all to hash passwords in the database layer. All tests passed.
This bothers me for two reasons. First, I don't know which of
Digest::SHA
or PostgreSQL is doing the right thing, because I
don't know what the right thing is. I can make a case for both behaviors,
depending on whether I care more about doing what the user intends or being
strict about the data at the interface. I've argued it both ways explaining it
to co-workers.
Second, I went to all of this work to prevent bugs from occurring and to do the right thing for people who'll probably never notice that our code does the right thing—and I'm sure almost every website I've ever used in my life gets this wrong, including (especially?) banks.
That's only slightly horrifying.
While Digest::SHA1 doesn't mention unicode, the POD for Digest::SHA does.
What it says matches my understanding: that hash algorithms operate on bits which are easy to get from bytes. But if you have text, then you need to encode it as bytes before you hash. The hashing should be treated like another output.
I'm wondering why you opted for doing the hashing in the database. I would have opted for the other way, if only to minimise the layers that have access to the raw password.
On further thought I would strongly disagree with hashing in the database because your application now relies on an assumption that you don't actually know. What if that assumption changes with database versions? Or maybe the assumption is actually being made inside DBI or DBD::Pg?
A final comment is that at my work we had some bugs caused by differences in date manipulations between the database and perl. So it always pays to be extra careful when you have two different layers attempting to perform the same action.