Reveng of the Hidden Diacritics
Diacritics are not just useful for mathematicians and Zalgo fun, they can have some practical use for us, mere non-English Latin-alphabet users.
Take a random accented character, like “é” in “éric”, well it can be either pre-composed (U+00E9) or decomposed as “e” (U+0065) plus “acute accent” (U+0301).
Size-wise, well a pre-composed ‘é’ is 2 bytes-wide in utf-8 and utf-16, a decomposed one is 3-bytes in utf-8, 4-bytes in utf-16.
Now, why would anyone use decomposed characters? Well as the wikipedia article on pre-composed characters puts it “precomposed characters are the legacy solution for representing many special letters in various character sets”. Even if you don’t look at Chinese, the hordes of pre-composed Latin characters in the Unicode map certainly contribute to character count inflation.
But pre-composition doesn’t just affect character count, it also affects font complexity and collation, as there is no way around using collation tables to sort out and compare pre-composed characters.
Collation is why comparison and sorting Unicode string is much slower than doing the same for ASCII strings, and why optimistic approaches can work well in the field.
Fun with Decomposed Diacritics
Comparing and sorting de-composed latin characters is quite simpler, and effectively backward compatible. Upper and lower-casing is similarly much simpler.
Whip up your favorite Unicode Delphi and create a new application. Drop a TEdit and two TLabel. In the Form’s OnCreate have something like:
and in the Edit’s OnChange drop code like:
Label2.Caption := UpperCase(Edit1.Text); // UpperCase, the classic one Label1.Caption := AnsiUpperCase(Edit1.Text); // AnsiUpperCase, the misnommed
Run the application, you’ll see ‘éé’ in the edit box, ‘éÉ’ in the UpperCase label and ‘ÉÉ’ and in the AnsiUpperCase one. The decomposed code survived the classic code!
Similarly, sometimes you have contact lists with alphabetical tabs or shortcuts, but in many cases, those apps don’t handle accented upper-case characters well, and ‘É’ ends up in the ‘Others’ tab rather than the ‘E’ one. If you use a decomposed character. It will work.
If like me you have a diacritic on the first letter of your name, think of the decomposed diacritics. You can’t type them easily, but you can copy-paste them easily. I can’t type É easily either, so it’s a toss.
Oh, and don’t forget to bug the developers later on, when the decomposed character will trip some of their misplaced assumptions that ucs-2 is utf-16 somewhere else in their code 😉