Crouching Smileys, Hidden Diacritics

By Eric Grange / October 11, 2013

Reveng of the Hidden Diacritics

Diacritics are not just useful for mathematicians and Zalgo fun, they can have some practical use for us, mere non-English Latin-alphabet users.

Take a random accented character, like “é” in “éric”, well it can be either pre-composed (U+00E9) or decomposed as “e” (U+0065) plus “acute accent” (U+0301).

Size-wise, well a pre-composed ‘é’ is 2 bytes-wide in utf-8 and utf-16, a decomposed one is 3-bytes in utf-8, 4-bytes in utf-16.

Now, why would anyone use decomposed characters? Well as the wikipedia article on pre-composed characters puts it “precomposed characters are the legacy solution for representing many special letters in various character sets”. Even if you don’t look at Chinese, the hordes of pre-composed Latin characters in the Unicode map certainly contribute to character count inflation.

But pre-composition doesn’t just affect character count, it also affects font complexity and collation, as there is no way around using collation tables to sort out and compare pre-composed characters.

Collation is why comparison and sorting Unicode string is much slower than doing the same for ASCII strings, and why optimistic approaches can work well in the field.

Fun with Decomposed Diacritics

Comparing and sorting de-composed latin characters is quite simpler, and effectively backward compatible. Upper and lower-casing is similarly much simpler.

Whip up your favorite Unicode Delphi and create a new application. Drop a TEdit and two TLabel. In the Form’s OnCreate have something like:

Edit1.Text:='ée'+#$0301;

and in the Edit’s OnChange drop code like:

Label2.Caption := UpperCase(Edit1.Text);  // UpperCase, the classic one
Label1.Caption := AnsiUpperCase(Edit1.Text); // AnsiUpperCase, the misnommed

Run the application, you’ll see ‘éé’ in the edit box, ‘éÉ’ in the UpperCase label and ‘ÉÉ’ and in the AnsiUpperCase one. The decomposed code survived the classic code!

Similarly, sometimes you have contact lists with alphabetical tabs or shortcuts, but in many cases, those apps don’t handle accented upper-case characters well, and ‘É’ ends up in the ‘Others’ tab rather than the ‘E’ one. If you use a decomposed character. It will work.

If like me you have a diacritic on the first letter of your name, think of the decomposed diacritics. You can’t type them easily, but you can copy-paste them easily. I can’t type É easily either, so it’s a toss.

Oh, and don’t forget to bug the developers later on, when the decomposed character will trip some of their misplaced assumptions that ucs-2 is utf-16 somewhere else in their code 😉

Pages: 1 2

18 thoughts on “Crouching Smileys, Hidden Diacritics”

adrian says:

October 11, 2013 at 16:09

I am getting the feeling that this UTF16 “hate” is going out of control, so I’ll try to provide a little of balance:

I’ve always found the critics to UTF16 overblown, and found that the “UTF8 manifesto” is too much one sided.

In short, it doesn’t really matter if we use utf8, 16 or even 32 as long as everybody uses the same.
UTF8 being late, it actually added more confusion to the existing standards, but well, we can’t hope for UTF8 going away more than we can hope for UTF16 going away. We have to deal with both.

About the advantages of UTF8, many of them are disadvantages too, depending on how you look at them.

1) Let’s start with the obvious: UTF8 is backwards compatible with ASCII. This is a big advantage, and the main problem with UTF8. Because you can call ASCII methods from your UTF8 app without realizing, and if you speak english you won’t notice a problem. If you used UTF16, it would be obvious that the “Length(…)” method you use is not UTF16 enabled. If you use UTF8 and call an ASCII Length(…) method, it will work… as long as you test it with ASCII characters.

UTF16 breaks all existing C string management routines (because of the character 0 which terminate the string). This ensures you won’t call any exiting and not revised for unicode string method.

2) Size: Yes, uncompressed UTF8 has an edge. Is it worth to switch to UTF8 just for this? I doubt it. And for compressed storage/transmission, the advantage isn’t that much.

3) The fact that you are going to find errors faster in UTF8, because as soon as you go outside English, it will break and you will realize. In theory this is an advantage, yes, but not such a big one. Because realistically UTF16 is good enough even if you don’t handle the multibyte part. It is true, you won’t be able to see “Pile of poo” ( http://www.fileformat.info/info/unicode/char/1f4a9/index.htm ) (and side note: how isn’t this one first at emojitracker??). Some chinese guys will be sad because they can’t write their name with a character. And you might have issues with composed characters. But the rest will work. And that is quite a good advance from today, where even with UTF everywhere, half of the times my name comes as “Adri?n” instead of “Adrián” in so many places that I don’t even use the á anymore.

I just got an email with this subject a moment ago: “Support: Javier HernÃ¡ndez”. Note that I got it in gmail, which should fully support unicode. But somewhere in the middle it got corrupted. And you can even see the reason: It was UTF8. You can see a good explanation here:
http://stackoverflow.com/questions/5127744/how-to-convert-these-strange-characters
but in short, some UTF8 was handled by a some non-utf8 enabled method, and the characters that compose ‘á’ were separated. I see things like this everywhere, and I think that if we can’t yet correctly handle accented characters, we shouldn’t worry that much about the pile of poo.

But well, all in all I doesn’t matter. If you want to support either UTF8 or UTF16 right, you need to handle multibyte characters, and the effort to do so is the same. If you don’t handle multibyte characters, UTF8 will fail for everything but plain english, and UTF16 will fail for some very special cases. If this is an UTF8 advantage or disadvantage depends on how you look at the problem.

But also, correctly handling multibyte characters is the easiest thing on unicode, so I am not sure how changing to UTF8 would help. You need to take care of ligatures, right to left, up to down, composed characters, etc, etc, and the chances that your app correctly supports all of that stuff if you don’t design for it and test it is very slim, no matter if you use UTF8 or UTF32. If you only worry about multibyte your app might be able to show “unamused face”, but not arabic text.
Kazantsev Alexey says:

October 11, 2013 at 17:18

When i say about simple of UTF-16, i has in mind comparison with UTF-8. Yes, UTF-16 require support supplementary range of unicode, but it’s more simple if you compare it with UTF-8. When we work with UTF-16 string we must check current code-point for belonging to surrogates (single range) and if it’s then decode surrogate pair to supplementary unicode code-point (i don’t say about checking of correctness of surrogate-pair now). With UTF-8 we need to check current byte for belonging to one of four ranges, only for getting length of next sequence of bytes. And it without any checkings of correctness. When we need check of correctness of string (it needed for any input data) we will be doing more dirty work (9 ranges with proper byte sequences) than checking of surrogate pair for UTF-16.
Eric says:

October 11, 2013 at 17:25

This isn’t as much utf-8 vs utf-16, as it is utf-16 vs ucs-2.

Most of the code out there that claims to be utf-16 is actually just ucs-2 capable, what you’re arguing by “utf-16” being good enough is actually “ucs-2” is good enough.
Interestingly enough, the point you complain about your mail is some code that treated utf-8 as if it was ASCII, yet you argue about treating utf-16 as if it was ucs-2, which will result in the same kind of errors.

My point is that the claims that “utf-16 is easier” is just plain wrong. ucs-2 was easier, but ucs-2 is not Unicode, and increasingly less so in the modern world. And Unicode is messy complex, regardless of the utf. The diacritic issues can still happen for utf-32 strings.

If all those accents are still messed up these days, it’s because of the ASCII and UCS-2 programmers. UCS-2 is as much legacy as ASCII is.
utf-16 is backward compatible with Unicode, and in some way that’s part of the problem, as it allows old ucs-2 code to process Unicode without crashing outright. Same story as between utf-8 and ASCII.

The smiley thing is probably just funny right now, but it’s an illustration that the standard is going forward, and you can’t cling to ucs-2 anymore than you can cling to ASCII for anything more than backward compatibility.
Eric says:

October 11, 2013 at 17:36

@Kazantsev With UTF-8 you only need to check the range when converting between encodings. For usual processing, you just have to think in terms of comparing & searching (sub)strings.

An isolated Unicode code-point is insufficient (because of diacritics), so you’ve got to let go the notion that a character can be expressed by a single integer value (be it 8, 16 or 32 bits), because it isn’t so in Unicode.
In layman terms, an “end-user character” is a Unicode string.

That’s IMHO the key difference between Unicode and ASCII, UCS-2 or older MBCS. In Unicode, characters are not a single ordinal but a list of ordinals.
adrian says:

October 11, 2013 at 18:33

Eric,
>This isn’t as much utf-8 vs utf-16, as it is utf-16 vs ucs-2.

Well, if the main point was utf16 vs ucs2 then I agree, people should be aware of utf16. I was commenting in the “we should switch from utf16 to utf8” claims that cropped everywhere. UTF8 is as hard as UTF16.

>Most of the code out there that claims to be utf-16 is actually just ucs-2 capable, what you’re arguing by “utf-16″ being good enough is actually “ucs-2″ is good enough.

I didn’t argue that “utf16” is good enough, I argued that “realistically UTF16 is good enough even if you don’t handle the multibyte part” where by saying “you don’t handle the multibyte part”, I was in fact referring to treating utf16 as ucs2. I just didn’t use the term ucs2 because I didn’t wanted to introduce new terminology and also even ms isn’t clear about what ucs2 exactly is.

>Interestingly enough, the point you complain about your mail is some code that treated utf-8 as if it was ASCII, yet you argue about treating utf-16 as if it was ucs-2, which will result in the same kind of errors.

What I wanted to say is that while indeed they are the same kind of errors, they aren’t in the same order of magnitude as UTF8 proposers say. Yes, if you use UCS2 instead of UTF16 you commit the “same” mistake as if you use ASCII instead of UTF8. But the consequences aren’t the same. Failing to handle multibyte in UTF16 would still produce Adrián and Éric. While failing to handle multibyte in UTF8 will break anything that is not ASCII.

While I completely agree that we should treat UTF16 as UTF16 and not UCS2, I also think that the consequences of the mistake are much worse if you treat UTF8 as ASCII. As said, I would like to see a world where at least à is treated ok, even if ☕ isn’t. Of course, I would prefer a world where everything is treated right, but this just won’t happen, not with UTF8 nor with UTF16. So accepting that there will be people that makes the mistakes, I prefer the mistake of using UCS2 instead of ASCII.

>The smiley thing is probably just funny right now, but it’s an illustration that the standard is going forward, and you can’t cling to ucs-2 anymore than you can cling to ASCII for anything more than backward compatibility.

While I agree that indeed in the future the problem will get worse (as soon as we even get fonts that supports those characters, most today don’t), I think you will be able to cling to ucs-2 for a much longer time than you can cling to ASCII. Doesn’t mean you should.

But well, my comment was against the idea of “UTF16 is complex, so let’s switch to UTF8 (which is as complex)”. Or “UTF16 is an historical artifact, let’s switch to UTF8 that is also an historical artifact”. Yes, if we were creating the world today we wouldn’t have either utf8 or utf16, as both were designed with backwards compatibility in mind (with ASCII and UCS2 respectively) and are constrained by that compatibility.
Kazantsev Alexey says:

October 11, 2013 at 18:57

@Eric: Any char-by-char algorithm requires detection of bytes sequence length. Any data from external world always requires full checking of correctness.
Andreas says:

October 12, 2013 at 15:35

I haven’t seen a convincing technical argument in favour of UTF-16 yet.

Adrians argument seem to be that a broken UTF-16 implementation is better than a broken UTF-8 implementation. meh.. Is that’s really a technological argument? That’s more a political argument involving an uneducated userbase.
Also using compression doesn’t come for free.

Alexey maybe has a special algorithm in mind that’s easier to do in UTF-16 and then postulates a dogma: “Any data from external world always requires full checking of correctness.” (well – I just don’t buy it)

Some arguments for UTF-8:
– ASCII is as common as dirt. There are a lot of standards out there that use it. Processing ASCII by using UTF-16 is confusing.
– on the algorithmic side UTF-16 has this Factor 2 of inefficiency for ASCII built in – that’s going to cost. You may find an edge case where UTF-16 does ok, but that inefficiency is not going away as long as ASCII matters.
Last week I just used a dictionary-class where a node needs a pointer for each character (256 Pointers). Using 16 bit Characters would baloon the Memory usage by a factor of 256. And if I process UTF-16 Characters by splitting them it’d double the lookup times for ASCII.

Finding a technical argument in favour of UTF-16 seems to be hard.

So there’s the question: What kind of algorithms actually favour UTF-16?
So far the situations I’ve had to deal with all strongly favoured UTF-8. And technical arguments I’ve seen in favour of UTF-16 usually boiled down to: “UCS-2 is shiny”.
Kazantsev Alexey says:

October 12, 2013 at 20:40

@Andreas: It not dogma, it’s real necessity of our world. If we not check input data then we open doors for problems. UTF-8 RFC say about it (http://tools.ietf.org/html/rfc3629#section-10). If we use UTF-8 as internal format of string then everyone of us must be doing such a checks for input data.

ASCII, ASCII, ASCII… Wake up Neo, World not ASCII ONLY any more. Any code point outside of ASCII range will be represented in UTF-8 from two to four bytes. Bigger part of BMP (after code point #2047) will required three bytes for each code point in UTF-8 and two for UTF-16. About what saving of memory can there be a speech? Saving on ASCII? It’s unserious.

My point: UTF-16 more simple than UTF-8 on manipulations with code points. It simpler and faster because require less operations for this.

One of my use cases.
I have class for serialization data in some format. When i have long string which cannot be placed in buffer completely, i must divide her by character border. For UTF-16 it very simple job: the main objective – not to divide surrogate pair if it generally exists. For UTF-8 it really more complex job.
Andreas says:

October 12, 2013 at 23:41

@Alexey: Wake me up when the the Latin Alphabet, Numbers and Whitespace got out of fashion.
Just look around. This here is an article about non BMP-Characters – how many Characters do you see here on this webpage that are not ASCII?

For your serialization the Algorithm is probably something like Memcopy (Bufsize-5) Bytes – loop over the next 5 bytes until you hit a 0xxxxxxx or a 11xxxxxx.

It’s slightly easier to code with UTF-16, but for serializing e.g. this webpage with UTF-16 you’ll need about twice as many Buffers. For the majority of usecases (and for pretty much every text I’ll ever have to deal with) UTF-8 will be roughly twice as fast (the additional loop at the tail will cost a few percent). So generally choosing UTF-16 to serialize data (e.g. Numbers) would be just completely insane.
Kazantsev Alexey says:

October 13, 2013 at 10:13

@Andreas: I can show sites where cyrillic alphabet or hieroglyphs is used. And on this site i can use my native language, but it won’t improve our understanding 🙂 With sites and data in XML we have very interesting situation. When used language-specific encoding-schemes (windows-1251 for cyrillic for example) it’s more efficient than using of universal UTF-8 encoding and doesn’t carry to loss of data. HTML and XML allow use of escaping for chars which cannot be represented in document encoding-scheme. Therefore main content can be represented in one-byte encoding-scheme but unrepresentable data (citation from chinese site on cyrillic news-site for example) can be escape. And it more efficient than UTF-8, if to speak about efficiency 🙂

But we talk about using string format for internal purposes. My serializer working with strings in UTF-16 but its a output may be any other encoding schemes from ASCII and UTF-8 to UTF-32 and local language-specific encoding schemes. I don’t consider UTF-16 convenient for a data interchange and i already spoke about this.
Kryvich says:

October 13, 2013 at 11:21

@adrian
> Yes, if you use UCS2 instead of UTF16 you commit the “same” mistake as if you use ASCII instead of UTF8. But the consequences aren’t the same. Failing to handle multibyte in UTF16 would still produce Adrián and Éric. While failing to handle multibyte in UTF8 will break anything that is not ASCII.

It’s mean that UTF-16 errors are more hidden and hard to find than UTF-8.

@Alexey
> When i have long string which cannot be placed in buffer completely, i must divide her by character border. For UTF-16 it very simple job: the main objective – not to divide surrogate pair if it generally exists. For UTF-8 it really more complex job.

To avoid errors you have to call library functions for such work, no matter what a character encoding you use.
Kazantsev Alexey says:

October 13, 2013 at 12:43

@Kryvich: It won’t reduce executed number of operations.
Andreas says:

October 14, 2013 at 12:58

@Alexey
I see that UTF-8 is not perfect for cyrillic, but the more efficienct encoding of whitespace, punctuation and numbers should still make it a close tie with UTF-16 there.

But I think UTF-16 isn’t a good encoding for cyrillic either. If I was living in Russia and had your willingness to inspect and validate everything and needed an efficient internal representation of unicode, I’d probably go with a custom 8-bit encoding optimized for cyrillic.

That would again have the potential to beat UTF-8 and UTF-16 by the factor 2 in a lot of usecases.
(e.g. using 00000000 – 10111111 for ASCII + basic cyrillic and 11xxxxxx as a start for everything else – fixed width of n bytes or whatever…)

Hasn’t someone in Russia already invented a 8-bit encoding standard for unicode that’s optimized for cyrillic?
Kazantsev Alexey says:

October 14, 2013 at 17:48

@Andreas: Cyrillic is a example and not anymore. UTF-8 is not perfect for any non-ASCII alphabet. And what is more, for some alphabets UTF-8 worse than UTF-16. In global world we can’t consider ASCII as central point.

When you say about whitespace, punctuation and numbers, you still say about internal unicode representation? Because talk about this.
Kryvich says:

October 15, 2013 at 11:32

@Alexey
> It won’t reduce executed number of operations.

I doubt that the difference will be essential in real world applications.

@Andreas
Actually, cyrillic letters occupy 2 bytes in UTF-8, same as in UTF-16. Therefore an UTF-8 encoded cyrillic text more compact than UTF-16 one. And yes, we can use CP-1251 internally to process cyrillic texts. So for Russian coders there is no gain in using of UTF-16 encoding.
Kryvich says:

October 15, 2013 at 11:52

Eric, it would be very desirable to have both UTF-8 and ANSI string types in DWScript, with String type = UTF8String. I don’t think that it’ll be hard to add both, as long as both UTF-8 and ANSI strings consist of 1-byte code units internally.
Eric says:

October 15, 2013 at 12:53

@Kryvich The issue is that Delphi RTL has good UTF-16 support, but poor (or missing) UTF-8 support, and UTF-16 is “native” in Windows & JavaScript, but FreePascal has good UTF-8 support but poor (or missing) UTF-16 support, and UTF-8 is “native” in Linux.

This isn’t so much about supporting both, it’s more about whether String is utf-8, utf-16, or if it depends on the platform. If what String is does not depend on the platform, some platforms will be penalized, and the penalty is quite significant in FreePascal currently, as strings tends to enter and leave the script engine a lot IME.
Kryvich says:

October 15, 2013 at 18:10

Eric,
you can declare the default String type as you want, providing that all main string encodings will be supported in DWS. And coders can explicitly use AnsiString, UTF8String, UnicodeString (UTF-16) and UTF32String types. Though I would set String = UTF8String for FPC/Linux to avoid penalties.

Comments are closed.

DelphiTools

DWS, Profiler and other Pascal tools

Crouching Smileys, Hidden Diacritics

Reveng of the Hidden Diacritics

Fun with Decomposed Diacritics

18 thoughts on “Crouching Smileys, Hidden Diacritics”

Reveng of the Hidden Diacritics

Fun with Decomposed Diacritics

Related posts

18 thoughts on “Crouching Smileys, Hidden Diacritics”