SQLite, Perl, and a Boolean

Felipe Gasper - May 30 '21 - - Dev Community

I’ve written several articles now about the trials and tribulations of character encoding in Perl. Having gained the knowledge I have, I’ve also been finding bugs in libraries we use at $work and sending patches to their maintainers.

The latest one is DBD::SQLite, CPAN’s self-contained SQLite binding. It’s a great library that I’ve used for years, but I recently noted two problems in it:

1) In its default configuration it used the SvPV macro to translate Perl strings to C strings, which is bad for reasons I detailed in “Perl’s SvPV Menace”.

2) In its (non-default) “unicode” configuration it used a “naïve” method of UTF-8 decoding that neglects validation. This mechanism can corrupt Perl’s internals by making it mistake invalid UTF-8 sequences for valid ones.

Neither of these is trivial to fix: applications may depend on the SvPV problem—what one coworker of mine calls a “load-bearing bug” 😀—while adding UTF-8 validation entails a performance hit.

In reality, DBD::SQLite needed at least 4 modes of translating between Perl and C strings:

1) The current (“load-bearing-buggy”) default.
2) Same as #1, but use SvPVbyte to avoid the SvPV bug.
3) Current “naïve unicode” behaviour.
4) A “non-naïve unicode” mode that validates incoming UTF-8.

(I eventually made two variants of this last one: one that just warns on invalid data, and the other that throws an exception.)

There was another problem, though: DBD::SQLite’s interface for controlling this was a boolean. That meant only two modes were even possible!

This exemplifies a principle a mentor of mine taught me years back: avoid boolean parameters. They restrict your ability to add additional configurations.

(And for pity’s sake, abhor unnamed booleans in particular! What does the 0 in open_file($path, 0) mean??)

To fix this my pull request had to deprecate the existing sqlite_unicode parameter. It’s an unfortunate step that’ll produce new warnings in existing applications, but the “omelet” here justifies the “broken egg”.

. . . . . .