Crouching Smileys, Hidden Diacritics

500px-Unicode_logoAs noted in a recent post, Unicode is not so straightforward. Namely claims of utf-16 being simpler than utf-8, or that you do not have to care about Unicode complexities.

Maybe that was the case ten years ago, but The Unicode jungle is much closer to home these days.

Here are a few dangers lurking in the not-so-dark shadows.

Characters that don’t fit in a WideChar are rare

dragon
When speaking of characters outside the BMP (Basic Multilingual Plane, ie. what fits in a single WideChar), everyone thinks of Egyptian hieroglyphs or rare/historical Chinese characters, which westerners can safely ignore and easterners will smile about.

Alas, those rare Chinese characters happen in names, and while you could just take the haughty road and ignore that the name of worker #85901 in factory #5627 was getting mangled, when the issue hits well-known politicians… it ended with laws requiring proper support of all the Unicode characters.

Nowadays you also have the Mathematical symbols outside the BMP, which are cropping up more and more, thanks to growing software support.

But you may not sell your software to China or mathematicians.
So, you are safe, aren’t you?

Strike of the Crouching Smileys

128514But the most common character outside the BMP is a smiley, U+1F602, otherwise knows as “FACE WITH TEARS OF JOY”. At the time of writing, it was ranked #2 in emojitracker.

And it has friends like U+1F612 “UNAMUSED FACE” at rank #5.

Not convinced? Check U+1F648, U+1F46F or U+1F357. I’m sure you’ll see the light after seeing those.

If your code crops strings blindly to certain lengths or doesn’t generally treat utf-16 as being the variable-length encoding it is, don’t be surprised if you hear that your software is an “epic fail”. Even infrequent savaging of smileys will have that effect.

Slicing emoji could easily happen if you have a truncation to display, and not only is that not Kawaii, it could also change the meaning of the message as the erroneous sliced character could show up as a question mark “?”…

But you may not sell your software to teens, or may generally frown on the twittering crowds.
So, you are safe, aren’t you?

Next: Hidden Diacritics

18 thoughts on “Crouching Smileys, Hidden Diacritics

  1. I am getting the feeling that this UTF16 “hate” is going out of control, so I’ll try to provide a little of balance:

    I’ve always found the critics to UTF16 overblown, and found that the “UTF8 manifesto” is too much one sided.

    In short, it doesn’t really matter if we use utf8, 16 or even 32 as long as everybody uses the same.
    UTF8 being late, it actually added more confusion to the existing standards, but well, we can’t hope for UTF8 going away more than we can hope for UTF16 going away. We have to deal with both.

    About the advantages of UTF8, many of them are disadvantages too, depending on how you look at them.

    1) Let’s start with the obvious: UTF8 is backwards compatible with ASCII. This is a big advantage, and the main problem with UTF8. Because you can call ASCII methods from your UTF8 app without realizing, and if you speak english you won’t notice a problem. If you used UTF16, it would be obvious that the “Length(…)” method you use is not UTF16 enabled. If you use UTF8 and call an ASCII Length(…) method, it will work… as long as you test it with ASCII characters.

    UTF16 breaks all existing C string management routines (because of the character 0 which terminate the string). This ensures you won’t call any exiting and not revised for unicode string method.

    2) Size: Yes, uncompressed UTF8 has an edge. Is it worth to switch to UTF8 just for this? I doubt it. And for compressed storage/transmission, the advantage isn’t that much.

    3) The fact that you are going to find errors faster in UTF8, because as soon as you go outside English, it will break and you will realize. In theory this is an advantage, yes, but not such a big one. Because realistically UTF16 is good enough even if you don’t handle the multibyte part. It is true, you won’t be able to see “Pile of poo” ( http://www.fileformat.info/info/unicode/char/1f4a9/index.htm ) (and side note: how isn’t this one first at emojitracker??). Some chinese guys will be sad because they can’t write their name with a character. And you might have issues with composed characters. But the rest will work. And that is quite a good advance from today, where even with UTF everywhere, half of the times my name comes as “Adri?n” instead of “Adrián” in so many places that I don’t even use the á anymore.

    I just got an email with this subject a moment ago: “Support: Javier Hernández”. Note that I got it in gmail, which should fully support unicode. But somewhere in the middle it got corrupted. And you can even see the reason: It was UTF8. You can see a good explanation here:
    http://stackoverflow.com/questions/5127744/how-to-convert-these-strange-characters
    but in short, some UTF8 was handled by a some non-utf8 enabled method, and the characters that compose ‘á’ were separated. I see things like this everywhere, and I think that if we can’t yet correctly handle accented characters, we shouldn’t worry that much about the pile of poo.

    But well, all in all I doesn’t matter. If you want to support either UTF8 or UTF16 right, you need to handle multibyte characters, and the effort to do so is the same. If you don’t handle multibyte characters, UTF8 will fail for everything but plain english, and UTF16 will fail for some very special cases. If this is an UTF8 advantage or disadvantage depends on how you look at the problem.

    But also, correctly handling multibyte characters is the easiest thing on unicode, so I am not sure how changing to UTF8 would help. You need to take care of ligatures, right to left, up to down, composed characters, etc, etc, and the chances that your app correctly supports all of that stuff if you don’t design for it and test it is very slim, no matter if you use UTF8 or UTF32. If you only worry about multibyte your app might be able to show “unamused face”, but not arabic text.

  2. When i say about simple of UTF-16, i has in mind comparison with UTF-8. Yes, UTF-16 require support supplementary range of unicode, but it’s more simple if you compare it with UTF-8. When we work with UTF-16 string we must check current code-point for belonging to surrogates (single range) and if it’s then decode surrogate pair to supplementary unicode code-point (i don’t say about checking of correctness of surrogate-pair now). With UTF-8 we need to check current byte for belonging to one of four ranges, only for getting length of next sequence of bytes. And it without any checkings of correctness. When we need check of correctness of string (it needed for any input data) we will be doing more dirty work (9 ranges with proper byte sequences) than checking of surrogate pair for UTF-16.

  3. This isn’t as much utf-8 vs utf-16, as it is utf-16 vs ucs-2.

    Most of the code out there that claims to be utf-16 is actually just ucs-2 capable, what you’re arguing by “utf-16” being good enough is actually “ucs-2” is good enough.
    Interestingly enough, the point you complain about your mail is some code that treated utf-8 as if it was ASCII, yet you argue about treating utf-16 as if it was ucs-2, which will result in the same kind of errors.

    My point is that the claims that “utf-16 is easier” is just plain wrong. ucs-2 was easier, but ucs-2 is not Unicode, and increasingly less so in the modern world. And Unicode is messy complex, regardless of the utf. The diacritic issues can still happen for utf-32 strings.

    If all those accents are still messed up these days, it’s because of the ASCII and UCS-2 programmers. UCS-2 is as much legacy as ASCII is.
    utf-16 is backward compatible with Unicode, and in some way that’s part of the problem, as it allows old ucs-2 code to process Unicode without crashing outright. Same story as between utf-8 and ASCII.

    The smiley thing is probably just funny right now, but it’s an illustration that the standard is going forward, and you can’t cling to ucs-2 anymore than you can cling to ASCII for anything more than backward compatibility.

  4. @Kazantsev With UTF-8 you only need to check the range when converting between encodings. For usual processing, you just have to think in terms of comparing & searching (sub)strings.

    An isolated Unicode code-point is insufficient (because of diacritics), so you’ve got to let go the notion that a character can be expressed by a single integer value (be it 8, 16 or 32 bits), because it isn’t so in Unicode.
    In layman terms, an “end-user character” is a Unicode string.

    That’s IMHO the key difference between Unicode and ASCII, UCS-2 or older MBCS. In Unicode, characters are not a single ordinal but a list of ordinals.

  5. Eric,
    >This isn’t as much utf-8 vs utf-16, as it is utf-16 vs ucs-2.

    Well, if the main point was utf16 vs ucs2 then I agree, people should be aware of utf16. I was commenting in the “we should switch from utf16 to utf8” claims that cropped everywhere. UTF8 is as hard as UTF16.

    >Most of the code out there that claims to be utf-16 is actually just ucs-2 capable, what you’re arguing by “utf-16″ being good enough is actually “ucs-2″ is good enough.

    I didn’t argue that “utf16” is good enough, I argued that “realistically UTF16 is good enough even if you don’t handle the multibyte part” where by saying “you don’t handle the multibyte part”, I was in fact referring to treating utf16 as ucs2. I just didn’t use the term ucs2 because I didn’t wanted to introduce new terminology and also even ms isn’t clear about what ucs2 exactly is.

    >Interestingly enough, the point you complain about your mail is some code that treated utf-8 as if it was ASCII, yet you argue about treating utf-16 as if it was ucs-2, which will result in the same kind of errors.

    What I wanted to say is that while indeed they are the same kind of errors, they aren’t in the same order of magnitude as UTF8 proposers say. Yes, if you use UCS2 instead of UTF16 you commit the “same” mistake as if you use ASCII instead of UTF8. But the consequences aren’t the same. Failing to handle multibyte in UTF16 would still produce Adrián and Éric. While failing to handle multibyte in UTF8 will break anything that is not ASCII.

    While I completely agree that we should treat UTF16 as UTF16 and not UCS2, I also think that the consequences of the mistake are much worse if you treat UTF8 as ASCII. As said, I would like to see a world where at least à is treated ok, even if ☕ isn’t. Of course, I would prefer a world where everything is treated right, but this just won’t happen, not with UTF8 nor with UTF16. So accepting that there will be people that makes the mistakes, I prefer the mistake of using UCS2 instead of ASCII.

    >The smiley thing is probably just funny right now, but it’s an illustration that the standard is going forward, and you can’t cling to ucs-2 anymore than you can cling to ASCII for anything more than backward compatibility.

    While I agree that indeed in the future the problem will get worse (as soon as we even get fonts that supports those characters, most today don’t), I think you will be able to cling to ucs-2 for a much longer time than you can cling to ASCII. Doesn’t mean you should.

    But well, my comment was against the idea of “UTF16 is complex, so let’s switch to UTF8 (which is as complex)”. Or “UTF16 is an historical artifact, let’s switch to UTF8 that is also an historical artifact”. Yes, if we were creating the world today we wouldn’t have either utf8 or utf16, as both were designed with backwards compatibility in mind (with ASCII and UCS2 respectively) and are constrained by that compatibility.

  6. I haven’t seen a convincing technical argument in favour of UTF-16 yet.

    Adrians argument seem to be that a broken UTF-16 implementation is better than a broken UTF-8 implementation. meh.. Is that’s really a technological argument? That’s more a political argument involving an uneducated userbase.
    Also using compression doesn’t come for free.

    Alexey maybe has a special algorithm in mind that’s easier to do in UTF-16 and then postulates a dogma: “Any data from external world always requires full checking of correctness.” (well – I just don’t buy it)

    Some arguments for UTF-8:
    – ASCII is as common as dirt. There are a lot of standards out there that use it. Processing ASCII by using UTF-16 is confusing.
    – on the algorithmic side UTF-16 has this Factor 2 of inefficiency for ASCII built in – that’s going to cost. You may find an edge case where UTF-16 does ok, but that inefficiency is not going away as long as ASCII matters.
    Last week I just used a dictionary-class where a node needs a pointer for each character (256 Pointers). Using 16 bit Characters would baloon the Memory usage by a factor of 256. And if I process UTF-16 Characters by splitting them it’d double the lookup times for ASCII.

    Finding a technical argument in favour of UTF-16 seems to be hard.

    So there’s the question: What kind of algorithms actually favour UTF-16?
    So far the situations I’ve had to deal with all strongly favoured UTF-8. And technical arguments I’ve seen in favour of UTF-16 usually boiled down to: “UCS-2 is shiny”.

  7. @Andreas: It not dogma, it’s real necessity of our world. If we not check input data then we open doors for problems. UTF-8 RFC say about it (http://tools.ietf.org/html/rfc3629#section-10). If we use UTF-8 as internal format of string then everyone of us must be doing such a checks for input data.

    ASCII, ASCII, ASCII… Wake up Neo, World not ASCII ONLY any more. Any code point outside of ASCII range will be represented in UTF-8 from two to four bytes. Bigger part of BMP (after code point #2047) will required three bytes for each code point in UTF-8 and two for UTF-16. About what saving of memory can there be a speech? Saving on ASCII? It’s unserious.

    My point: UTF-16 more simple than UTF-8 on manipulations with code points. It simpler and faster because require less operations for this.

    One of my use cases.
    I have class for serialization data in some format. When i have long string which cannot be placed in buffer completely, i must divide her by character border. For UTF-16 it very simple job: the main objective – not to divide surrogate pair if it generally exists. For UTF-8 it really more complex job.

  8. @Alexey: Wake me up when the the Latin Alphabet, Numbers and Whitespace got out of fashion.
    Just look around. This here is an article about non BMP-Characters – how many Characters do you see here on this webpage that are not ASCII?

    For your serialization the Algorithm is probably something like Memcopy (Bufsize-5) Bytes – loop over the next 5 bytes until you hit a 0xxxxxxx or a 11xxxxxx.

    It’s slightly easier to code with UTF-16, but for serializing e.g. this webpage with UTF-16 you’ll need about twice as many Buffers. For the majority of usecases (and for pretty much every text I’ll ever have to deal with) UTF-8 will be roughly twice as fast (the additional loop at the tail will cost a few percent). So generally choosing UTF-16 to serialize data (e.g. Numbers) would be just completely insane.

  9. @Andreas: I can show sites where cyrillic alphabet or hieroglyphs is used. And on this site i can use my native language, but it won’t improve our understanding 🙂 With sites and data in XML we have very interesting situation. When used language-specific encoding-schemes (windows-1251 for cyrillic for example) it’s more efficient than using of universal UTF-8 encoding and doesn’t carry to loss of data. HTML and XML allow use of escaping for chars which cannot be represented in document encoding-scheme. Therefore main content can be represented in one-byte encoding-scheme but unrepresentable data (citation from chinese site on cyrillic news-site for example) can be escape. And it more efficient than UTF-8, if to speak about efficiency 🙂

    But we talk about using string format for internal purposes. My serializer working with strings in UTF-16 but its a output may be any other encoding schemes from ASCII and UTF-8 to UTF-32 and local language-specific encoding schemes. I don’t consider UTF-16 convenient for a data interchange and i already spoke about this.

  10. @adrian
    > Yes, if you use UCS2 instead of UTF16 you commit the “same” mistake as if you use ASCII instead of UTF8. But the consequences aren’t the same. Failing to handle multibyte in UTF16 would still produce Adrián and Éric. While failing to handle multibyte in UTF8 will break anything that is not ASCII.

    It’s mean that UTF-16 errors are more hidden and hard to find than UTF-8.

    @Alexey
    > When i have long string which cannot be placed in buffer completely, i must divide her by character border. For UTF-16 it very simple job: the main objective – not to divide surrogate pair if it generally exists. For UTF-8 it really more complex job.

    To avoid errors you have to call library functions for such work, no matter what a character encoding you use.

  11. @Alexey
    I see that UTF-8 is not perfect for cyrillic, but the more efficienct encoding of whitespace, punctuation and numbers should still make it a close tie with UTF-16 there.

    But I think UTF-16 isn’t a good encoding for cyrillic either. If I was living in Russia and had your willingness to inspect and validate everything and needed an efficient internal representation of unicode, I’d probably go with a custom 8-bit encoding optimized for cyrillic.

    That would again have the potential to beat UTF-8 and UTF-16 by the factor 2 in a lot of usecases.
    (e.g. using 00000000 – 10111111 for ASCII + basic cyrillic and 11xxxxxx as a start for everything else – fixed width of n bytes or whatever…)

    Hasn’t someone in Russia already invented a 8-bit encoding standard for unicode that’s optimized for cyrillic?

  12. @Andreas: Cyrillic is a example and not anymore. UTF-8 is not perfect for any non-ASCII alphabet. And what is more, for some alphabets UTF-8 worse than UTF-16. In global world we can’t consider ASCII as central point.

    When you say about whitespace, punctuation and numbers, you still say about internal unicode representation? Because talk about this.

  13. @Alexey
    > It won’t reduce executed number of operations.

    I doubt that the difference will be essential in real world applications.

    @Andreas
    Actually, cyrillic letters occupy 2 bytes in UTF-8, same as in UTF-16. Therefore an UTF-8 encoded cyrillic text more compact than UTF-16 one. And yes, we can use CP-1251 internally to process cyrillic texts. So for Russian coders there is no gain in using of UTF-16 encoding.

  14. Eric, it would be very desirable to have both UTF-8 and ANSI string types in DWScript, with String type = UTF8String. I don’t think that it’ll be hard to add both, as long as both UTF-8 and ANSI strings consist of 1-byte code units internally.

  15. @Kryvich The issue is that Delphi RTL has good UTF-16 support, but poor (or missing) UTF-8 support, and UTF-16 is “native” in Windows & JavaScript, but FreePascal has good UTF-8 support but poor (or missing) UTF-16 support, and UTF-8 is “native” in Linux.

    This isn’t so much about supporting both, it’s more about whether String is utf-8, utf-16, or if it depends on the platform. If what String is does not depend on the platform, some platforms will be penalized, and the penalty is quite significant in FreePascal currently, as strings tends to enter and leave the script engine a lot IME.

  16. Eric,
    you can declare the default String type as you want, providing that all main string encodings will be supported in DWS. And coders can explicitly use AnsiString, UTF8String, UnicodeString (UTF-16) and UTF32String types. Though I would set String = UTF8String for FPC/Linux to avoid penalties.

Comments are closed.