Yes, a somewhat separate topic I think. The blog covers a lot of ground, but equally leaves us a lot of ground to experiment with. For example, it’s not clear what happens when you assign a STRING or CSTRING into a USTRING. Or perhaps more interestingly what happens when you assign a USTRING into a STRING etc. Presumably it makes use of system{prop:codepage} - but we’ll have to wait and see.
apart from the obvious u’ syntax, it’s not clear how the text inside the string should be encoded for non-ansi characters. As you say, that could be pasted from the clipboard, or in some cases typed on the keyboard (if you have a foreign keyboard.) Is it tenable to have ANSI CLW files and support unicode characters (utf-8 or utf-16, either way) in hard-coded strings?
Equally, how do characters work in a string assignment? As in
us = ‘<65,66,67,68,69’>
us = u’<65,0,66,0,67,0,68,0,69,0’>
Frankly, I think a lot of this will become evident once 12.1 (or whatever it’s called) ships. And I expect we can make some doc threads here to cover these situations.
Personally, I think SIZE should always represent the amount of memory used by the variable, not the length of the string for that special use.
Instead, I think that we should overload the LEN operator with a second parameter:
UStr USTRING(20)
CODE
UStr = 'String Value'
Memory = SIZE(UStr) !40 How much memory?
ActLen = LEN(UStr) !20? Traditional (trailing spaces in USTRING?)
ClpLen = LEN(UStr, LEN:Clip) !12 Like LEN(CLIP(UStr))
MaxLen = LEN(UStr, LEN:Max) !20 Max chars, like LEN(ClaStr)
Of course, one would also need to search one’s existing code to find places that have used SIZE with string parsing, and change those to LEN(S, LEN:Max).
I agree. Otherwise we end up with SIZE meaning different things depending where it’s used. Consistency really requires it to return the number of Bytes allocated, regardless of the variable type.
The current Help file has all the Unicode stuff that was not released in it. It days the compiler supports files encode UTF 8 or 16.
The C11 Clarion compiler supports source and include files in UTF-16 (little endian) and UTF-8 encoding to allow Unicode string literals without the necessity to use explicit character codes inside the <;> meta-symbols
The new compiler simply must support Unicode source files.
The current Help file has all the Unicode stuff under String Constants says:
Unicode string literals can include the same { } and <;> meta-symbols as their ANSI equivalents. Numbers listed between < and > meta-symbols are treated as 16-bit wide character codes.
Between < and > can have 16 bit value so Decimal values 0 to 65535. In your example for <decimal> I think that simply removes the “,0”. For Hex I assume it will allow <4 digits H>, but in Little Endian so it matches exactly the internal hex:
us = u'ABCDE'
us = u'<65,66,67,68,69’> ! Decimal of ABCDE
us = u'<4100h,4200h,4300h,4400h,4500h’> ! Hex
The Euro Sign is U+20AC or 8,364. I assume that page is in Big Endian so for Windows:
euro = u'<8364>'
euro = u'<0AC20h>' !Little Endian of U+20AC - ? wrong see below ?
Be nice to have a Big Endian hex format like the common U+ or \u e.g.
euro = u'<U+20AC>' !Idea of Big Endian
euro = u'<\u20AC>' !Idea for Big Endian
Edit 1/10/26
I’m probably wrong that a Unicode String <16 bit hex> is flipped to Little Endian. When you write a constant hex number LONG(20ACh) it’s Big Endian, so I would expect the same e.g. u'<20ACh>'.
It would be nice to have a syntax for UTF-32 values that are simpler than the 2 UTF-16 surrogate pairs. E.g. a smiling emoji in C# can be coded as '\U0001F60A' but in UTF-16 is u'<0D83Dh><0DE0Ah>'