Unicode Leftover Bug From Hell

Lytovchenko_Olexandr_KharonOr in other words, before getting to the gory details, DWScript now works when compiled with {$HIGHCHARUNICODE ON} on a machine with Cyrillic code-page 1251.

DWScript was converted years ago to Unicode, and been working just fine.

But there was a leftover bug from that crossing of the Styx.

Failing in an unexpected place

Last week Alexey Kazantsev reported a bug, where the DWScript tokenizer was failing on very trivial code. This is a portion of the engine that is heavily trodden, trampled and pounded upon by the unit tests, so it was very surprising.

Even more surprising is that I couldn’t reproduce it, we checked Delphi versions, settings, even source ZIP to excluded any SCM quirk. And the issue was still very “reliably” there in his case with when HIGHCHARUNICODE was ON, and very reliably not there in my case, regardless of settings.

After some more digging down to map files and executable binary comparison, it came down to two different constants values, and a simple line of code, in the tokenizer, where sets are used to define character ranges, f.i.

cANYCHAR - [#13, #10]

is used to describe any character but CR and LF, and cANYCHAR is declared as

const cANYCHAR = [#0..#255];

In practice, since the Unicode conversion, the tokenizer only uses those ranges for ASCII characters (so #0 to #127), so the extra #128..#255 range of cANYCHAR was unused, and if the range ended to anything above #127, everything worked.

Except when the code is compiled with HIGHCHARUNICODE ON for a machine running with codepage 1251 (Cyrillic)…

Hidden AnsiChar

Even though there is no AnsiString or AnsiChar in sight, character sets are hard-coded as being character sets of AnsiChar by the compiler (one of the problematic choices made when Unicode String were introduced in Delphi).

When compiling the #255, the compiler thus understands it as Unicode character Ux00FF (‘ÿ’, aka “Latin small letter y with diaeresis”), and then, silently converts it to Ansi using the current system code-page, which in that particular case means a ‘?’, as it’s not part of the 1251 code page.

So the constant declaration was then a silent equivalent of

const cANYCHAR = [#0..'?'];

Which obviously is not equivalent to “any char” anymore.

Bottom Line

Once that’s known, fixing it is simple, just change the declaration to 

const cANYCHAR = [#0..#127];

The bottom line is that even after being used years, there can still be bugs lurking in code that was converted to Unicode…

One thought on “Unicode Leftover Bug From Hell

  1. Truncate thus: “The bottom line is that even after being used years, there can still be bugs lurking in code.”

    My very first venture in unit testing was applied to a unit which had been used for years, was well understood, and stable. You know I will now tell you that it was also buggy. 😉 And it didn’t even have the added fun of Unicode. As it happens, however, that unit was amenable to exhaustive testing. It involved numeric conversion between two different notations for television time, and therefore spanned only 2.5 million possible values.

    I’m sure someone must have formulated a rule which asserts that the buggiest units of code are those which are considered the most straightforward?

Comments are closed.