As noted in a recent post , Unicode is not so straightforward. Namely claims of utf-16 being simpler than utf-8, or that you do not have to care about Unicode complexities.
Maybe that was the case ten years ago, but The Unicode jungle is much closer to home these days.
Here are a few dangers lurking in the not-so-dark shadows.
Characters that don’t fit in a WideChar are rare
When speaking of characters outside the BMP (Basic Multilingual Plane , ie. what fits in a single WideChar), everyone thinks of Egyptian hieroglyphs or rare/historical Chinese characters, which westerners can safely ignore and easterners will smile about.
Alas, those rare Chinese characters happen in names, and while you could just take the haughty road and ignore that the name of worker #85901 in factory #5627 was getting mangled, when the issue hits well-known politicians… it ended with laws requiring proper support of all the Unicode characters.
Nowadays you also have the Mathematical symbols  outside the BMP, which are cropping up more and more, thanks to growing software support.
But you may not sell your software to China or mathematicians.
So, you are safe, aren’t you?
Strike of the Crouching Smileys
But the most common character outside the BMP is a smiley, U+1F602, otherwise knows as “FACE WITH TEARS OF JOY”. At the time of writing, it was ranked #2 in emojitracker .
And it has friends like U+1F612 “UNAMUSED FACE” at rank #5.
Not convinced? Check U+1F648, U+1F46F or U+1F357. I’m sure you’ll see the light after seeing those.
If your code crops strings blindly to certain lengths or doesn’t generally treat utf-16 as being the variable-length encoding it is, don’t be surprised if you hear that your software is an “epic fail”. Even infrequent savaging of smileys will have that effect.
Slicing emoji could easily happen if you have a truncation to display, and not only is that not Kawaii , it could also change the meaning of the message as the erroneous sliced character could show up as a question mark “?”…
But you may not sell your software to teens, or may generally frown on the twittering crowds.
So, you are safe, aren’t you?
Reveng of the Hidden Diacritics
Diacritics are not just useful for mathematicians and Zalgo fun , they can have some practical use for us, mere non-English Latin-alphabet users.
Take a random accented character, like “é” in “éric”, well it can be either pre-composed (U+00E9) or decomposed as “e” (U+0065) plus “acute accent” (U+0301).
Size-wise, well a pre-composed ‘é’ is 2 bytes-wide in utf-8 and utf-16, a decomposed one is 3-bytes in utf-8, 4-bytes in utf-16.
Now, why would anyone use decomposed characters? Well as the wikipedia article on pre-composed characters  puts it “precomposed characters are the legacy solution for representing many special letters in various character sets”. Even if you don’t look at Chinese, the hordes of pre-composed Latin characters in the Unicode map certainly contribute to character count inflation.
But pre-composition doesn’t just affect character count, it also affects font complexity and collation, as there is no way around using collation tables to sort out and compare pre-composed characters.
Collation is why comparison and sorting Unicode string is much slower than doing the same for ASCII strings, and why optimistic approaches  can work well in the field.
Fun with Decomposed Diacritics
Comparing and sorting de-composed latin characters is quite simpler, and effectively backward compatible. Upper and lower-casing is similarly much simpler.
Whip up your favorite Unicode Delphi and create a new application. Drop a TEdit and two TLabel. In the Form’s OnCreate have something like:
and in the Edit’s OnChange drop code like:
Label2.Caption := UpperCase(Edit1.Text); // UpperCase, the classic one Label1.Caption := AnsiUpperCase(Edit1.Text); // AnsiUpperCase, the misnommed
Run the application, you’ll see ‘éé’ in the edit box, ‘éÉ’ in the UpperCase label and ‘ÉÉ’ and in the AnsiUpperCase one. The decomposed code survived the classic code!
Similarly, sometimes you have contact lists with alphabetical tabs or shortcuts, but in many cases, those apps don’t handle accented upper-case characters well, and ‘É’ ends up in the ‘Others’ tab rather than the ‘E’ one. If you use a decomposed character. It will work.
If like me you have a diacritic on the first letter of your name, think of the decomposed diacritics. You can’t type them easily, but you can copy-paste them easily. I can’t type É easily either, so it’s a toss.
Oh, and don’t forget to bug the developers later on, when the decomposed character will trip some of their misplaced assumptions that ucs-2 is utf-16 somewhere else in their code 😉