I’ve written several articles now about the trials and tribulations of character encoding in Perl. Having gained the knowledge I have, I’ve also been finding bugs in libraries we use at $work and sending patches to their maintainers.
The latest one is DBD::SQLite, CPAN’s self-contained SQLite binding. It’s a great library that I’ve used for years, but I recently noted two problems in it:
1) In its default configuration it used the SvPV
macro to translate Perl strings to C strings, which is bad for reasons I detailed in “Perl’s SvPV Menace”.
2) In its (non-default) “unicode” configuration it used a “naïve” method of UTF-8 decoding that neglects validation. This mechanism can corrupt Perl’s internals by making it mistake invalid UTF-8 sequences for valid ones.
Neither of these is trivial to fix: applications may depend on the SvPV
problem—what one coworker of mine calls a “load-bearing bug” 😀—while adding UTF-8 validation entails a performance hit.
In reality, DBD::SQLite needed at least 4 modes of translating between Perl and C strings:
1) The current (“load-bearing-buggy”) default.
2) Same as #1, but use SvPVbyte
to avoid the SvPV
bug.
3) Current “naïve unicode” behaviour.
4) A “non-naïve unicode” mode that validates incoming UTF-8.
(I eventually made two variants of this last one: one that just warns on invalid data, and the other that throws an exception.)
There was another problem, though: DBD::SQLite’s interface for controlling this was a boolean. That meant only two modes were even possible!
This exemplifies a principle a mentor of mine taught me years back: avoid boolean parameters. They restrict your ability to add additional configurations.
(And for pity’s sake, abhor unnamed booleans in particular! What does the 0 in open_file($path, 0)
mean??)
To fix this my pull request had to deprecate the existing sqlite_unicode
parameter. It’s an unfortunate step that’ll produce new warnings in existing applications, but the “omelet” here justifies the “broken egg”.