UTF-8, UTF-16 or both? (poll)

dws-mirrorThe FreePascal version of DWScript has been stalled for a little while on the incomplete UnicodeString (utf-16) support among other things.

It’s hard to blame the FreePascal team for that, given that Linux is primarily utf-8, and that utf-8 has quite a few advantages over utf-16.

utf-16 is an historical quirk

500px-Unicode_logoSummarily, utf-16 was designed in an era where 65536 characters was thought to be good enough for everyone, but this just didn’t quite turn out that way as Verity Stob recounts, and utf-16 is just as variable-length as utf-8 in the modern world where a fair share of people use alphabets with many glyphs. On the other hand, utf-16 became a de-facto standard in many languages and platforms, Java, .Net, JavaScript, Delphi since 2009, etc. despite its many quirks (from giving a false sense of security with characters to being exposed to endianness issues). And utf-16 isn’t even saved by non-latin content: just go to any Chinese text-heavy webpage and compare the utf-8 and utf-16 sizes. Punctuation and markup get utf-8 ahead.

DWScript situation

DWScript’s String type is currently utf-16, in Windows and in Smart, but I’m wondering whether to allow them to be utf-8 instead on some targets (for FreePascal & Linux). DWScript doesn’t have a distinct utf-16 character type, characters in DWScript are either Unicode code-points (utf-32) or small strings (to accommodate Chinese characters). While this would “fork” the language, the effects would be restricted to code that

  • iterates over a string by character indexes, and checks for non-ASCII characters
  • slices non-ASCII characters/strings at fixed offsets
  • depends on lengths of strings that have non-ASCII characters

All those cases are probably quite low-level. The rest of the code would remain utf-8/utf-16 agnostic.

So what do you think?

DWScript utf-8 and/or utf-16 Strings?

  • Stick to utf-16 everywhere (lower performance in Linux) (29%, 36 Votes)
  • Fork the language (use what's best on each platform) (56%, 71 Votes)
  • I have no idea what this all means (15%, 19 Votes)

Total Voters: 126

Loading ... Loading ...

Delphi mobile compilers

Note that I am aware that the new Delphi mobile compilers dropped UTF8String support (leading to ugly marshalling and performance issues), but the new Delphi mobile compilers have priced themselves out of the market as far as I’m concerned, and the reliance on FMX is either problematic or an extra cost.

So non-HTML5 mobile support for DWScript is more likely to come through FreePascal than Delphi at the moment.

freepascal

17 thoughts on “UTF-8, UTF-16 or both? (poll)

  1. Another could but as we used for our mORMot library: its kernel is UTF-8 with a lot of optimized code (which does make sense for JSON process e.g.), its database layer is natively UTF-8, and we switch to STRING only when we reach the VCL/RTL level.

    It is perfect for all compilers, and ensure that you will separate your layers.

    You are free to reuse the SynCommons.pas optimized and tested code for DWS, if you wish to (as you did with the DB layer). It may help your re-invent the wheel.

  2. @A. Bouchez At the moment in FreePascal, DWScript strings are utf-16, but the rest of the FreePascal RTL is utf-8. But the script engine can be very thin in some places, f.i. when the script is processing strings, the engine is basically just glueing together calls to RTL functions, and conversions are very penalizing performance-wise.

    Even if all common RTL functions were rewritten to avoid conversions, there would still be a penalty for all user functions.

  3. In SynCommons.pas, you have all UTF-8 optimized version of RTL string functions (like Pos, PosEx, IntToStr, StrToInt, FloatToString and so on).

    User functions should also use the same UTF-8 string type…

  4. UTF-8 in Linux will be used for interop only. All internal work is independent from system-preferences. I think DWS must be using compiler-preferable unicode-string type. And, as far as i know, it’s a UTF-16. I vote for UTF-16 🙂

  5. @Kazantsev So far the preferred string type for FreePascal is UTF8String. They have a “Unicode switch”, but it doesn’t work very well as most of the RTL/LCL and the rest of FreePascal is geared around utf-8…

  6. Delphi’s native UTF implementation, regardless of encoding size is fabulously easy to break and somewhat flakey. In practice, the UTF8 is at least a *little* more straightforward.

  7. UTF-8 scheme is more complex than UTF-16.
    UTF-8 scheme is more complex for char indexing.
    UTF-8 scheme is more complex for manipulations with chars.
    UTF-8 scheme is more comfortable for interaction with external world.

    Therefore UTF-16 (but better UTF-32, though pitiful of memory 🙂 ) must be used for internal unicode-representation, and UTF-8 for data exchange.

  8. My vote: stick to UTF-16, like all other environments that you’ve mentioned already.
    If Delphi, Java, DotNet and JavaScript ever move to something else, re-evaluate if a change it needed.

    Or, if you want to stick to the tradition of Pascal, you should pick a type that no other Pascal has ever used before. I think UTF-32 is not used yet?

    Older pascals: no strings available. Just arrays of characters, like C.
    Turbo pascal: max 255 8-bit characters; get/set the length via element 0;
    Old Delphi’s: ansi strings; variable length, 8 bit characters;
    Newer Delphi’s: UTF-16
    Mobile Delphi’s: no ansistrings available anymore. marshall UTF-16 to native types.
    FreePascal: UTF-8
    Oxygene on DotNet: System.String; immutable;
    Oxygene on IOS: auto-boxed NSString;
    ISO 10206 Extended Pascal: type String (Capacity: Integer) = record Length: 0 .. Capacity; String: packed array [1 .. Capacity + 1] of Char end;

    If I see the poll, we can add to this list:
    DWS: “depends on what’s best for each platform” 🙂

  9. fwiw I would say that complex technical decisions are not best decided by a democratic process. 😉

  10. @Kazantsev utf-16 isn’t so simple, beyond the issues of variable length it shares with utf-8, in cross-platform it also brings the endianness issue. While that can be ignored in the Wintel world, it can’t be in the Linux world…

    @Wouter DWScript is already utf-32 oriented. It doesn’t have a ByteChar or a WideChar type, you manipulate characters with small strings and code-points with utf-32. The Chr, Ord & “for c in string” functions operate in utf-32, “case of” and “in” support strings. So the only issues are really for indexed character access and slicing.

    @Jolyon It’s not as much a democratic process as it is a “taking of a the pulse” 😉

  11. I strongly favour UTF-8. Reading Data from external sources (Database, Web, …) is mostly ASCII.
    And bulk reading of data is where I find my bottlenecks. So Transcoding at the “boudary” of the application is something to avoid.

    Bonus-Points for offering an internal UTF-16 Type that can be used for optimization when planning to do something extreme to the Windows-API.

    Transcoding a string on the fly whenever I have to print it to the screen goes unnoticed – there’s a lot of other stuff going on – determine the rectangles, Colors, aligning all those characters nicely… – so transcoding there doesn’t make much of a difference.

    My advise: Look what Embarcadero is doing. Do exactly the opposite and you’re on the right track.

  12. In UTF-16 we have one index for any code point from BMP which cover almost all of languages in current time. UTF-8 has he for ASCII only. For checking for correct UTF-16 we need to check surrogate-pair only, when UTF-8 require check 9 ranges with proper byte sequences. Endianness have sense only for exchanging but not for internal using.

  13. @Kazantsev ucs-2 isn’t utf-16, besides Chinese, you have latin with diacritics (which I infrequently bump into, as my first name starts with an ‘É’, which I often have to type as just ‘E’ to avoid collations in systems, this made worse by keyboards only having ‘é’ directly accessible, if and if I enter ‘É’ as a decomposed character, most “Unicode” software breaks in one way or another), and finally, you’ve got some characters outside the BMP that are becoming very frequent (thanks to twitter)

  14. @Kazantsev: If you say that UTF-16 is less complex than UTF-8 you seem to ignore anything outside BMP. You’re moving out of the standard. You’re voting for UCS-2 – not UTF-16.

    If you care about the stardards you’ll notice that UTF-16 isn’t easier than UTF-8. It’s pretty similar to UTF-8
    + endianness problems
    + incompatibility with ASCII
    + forced conversion when talking to the world (Database, Web, …)

    ?!??

  15. Use UTF-8 everywhere. Allow conversion to and from UTF-16, UTF-32, ANSI just at the interop boundary.

  16. I say about UTF-16, not UCS-2. I mention supplementary range of UTF-16 (surrogate pairs) therefore i not ignore anything outside BMP. But it’s simpler than work with chars in UTF-8. For internal unicode-representation i not see of sense for using multi-byte encoding scheme. UTF-16 in this case is reasonable compromise between comfortable work with code points and used memory.

Comments are closed.