One of the “novelties” of the NextGen Delphi compiler is immutable strings, which I find quite puzzling, for lack of a better word, given that Delphi already had reference-counted copy-on-write strings, and the NextGen compiler uses reference-counted strings.
I always considered that Delphi’s String type was one of its remaining strong points, being a high-level abstraction (higher than Java’s or .Net’s String/StringBuilder dichotomy) with excellent low-level performance (on par with C/C++ character arrays).
From the recent discussions, it appears many don’t know what makes/made Delphi String so special, so here is a quick summary.
String being immutable means you can keep a single reference across threads without trouble. That’s an advantage over C strings.
It also means that copying a string, be it for an assignment or a parameter passing, is just like passing a reference, you don’t have to duplicate the content if you want to be sure it isn’t modified behind your back. That’s an advantage over C strings and StringBuilder.
Note that none of the above are advantages over Delphi Strings, since the copy-on-write mechanism means that Delphi Strings are effectively immutable once they’re referenced more than once.
Reference-counting vs Garbage Collection
Every time a new assignment or parameter passing is made, the reference count of the String has to be increased, this is an atomic lock, and is related to memory management, so it’s there whether you’re using simple reference-counting or copy-on-write.
Under a GC, no atomic lock is required, a simple reference (pointer) has to be copied. This is very efficient, locally, but the memory management costs are just deferred to a later garbage collection phase. Since immutable strings don’t have reference to other objects, the GC for them can theoretically happen in parallel without any drawbacks (assuming the GC supports it).
So under a GC, an immutable String type makes a whole lot of sense, as implementing a copy-on-write one requires a lot of effort, and a mutable one is problematic multi-threading wise.
Making reference-counted strings mutable doesn’t change any of the above, you just add one capability: when the reference count indicates there is no other reference to a string, then you can mutate it, ie. change characters, adjust its length, etc.
In other words, when the only reference to a string is a single variable locally scoped to a procedure, then it’s safe to do just about anything with it, the multi-threading issues can’t apply until that string is referenced somewhere else.
This is both convenient and very efficient, since what the compiler does before applying a mutation can be summarized as:
if myString is "referenced somewhere else" then myString := make a local copy of my String mutate myString
The local copy is of course referenced nowhere else, and thus is safe to mutate. Copy-on-write is really copy-on-mutate, as it encompasses just not changing the characters, but also resizing a string (re-allocations) and concatenations.
Keep in mind this is an “added-on” behavior, where you just take advantage of the memory management scheme being a reference-counting one. If you know what you’re doing and want more performance, you can even waive the COW check by using UniqueString(), which will ensure you have a local copy, and then acquiring a PChar to the string content.
It can be done under a GC, but means you have to maintain a reference count or similar information since the GC doesn’t have one. Android relies a lot on copy-on-write, and that was actually one key differentiation between Dalvik VM and more classic Java VM.
Advantages of RawByteString & UTF8String over TBytes
And this will be a bit more controversial, but Copy-On-Write is also why RawByteString/UTF8String can ofttimes make a lot more sense than TBytes for binary buffers: RawByteString isn’t just reference-counted (like TBytes), it is also supporting copy-on-write.
This means that in a multi-threaded environment, RawByteString shares the same advantages of immutability String enjoys, and which TBytes just doesn’t enjoy, as TBytes is always mutable.
String wraps up both advantages of Java/.Net String & StringBuilder, they have bother multi-threading immutability advantages and the mutability capability.
Performance-wise, under a speculative memory manager (like most modern allocators), you’ll also find that merely concatenating to a String is typically just as fast as using TStringBuilder, and in several occurrences it’s actually faster because String benefits from compiler magic, while TStringBuilder does not (also some TStringBuilder implementations are a little weak).
Alas some String performance was lost during the Unicode and 64bit transition, when some FastCode routines where replaced by lower performing pure-pascal ones, and you’ll lose even more performance with TStringHelper, which introduces some algorithmically poor pure-Pascal implementations.