What Z is saying that a USTRING always stores Unicode.
If you assign plain ASCII text to it, there is nothing to “convert” in any meaningful way because ASCII characters map 1:1 into Unicode, so it will look identical when you view it.
The part that can trip people up is not the USTRING, it is the source literal.
If your source file is ANSI and you write:
MyUnicode = ‘…’
that literal is treated as ANSI bytes. When Clarion moves it into a Unicode context (like a USTRING), it has to interpret those bytes using whatever codepage rules are in effect. For ASCII, it is safe and will always round-trip. For anything above 127, the result can depend on the codepage.
If you want to be explicit and avoid ambiguity for non-ASCII characters, use a Unicode literal or a conversion, for example:
MyUnicode = u’…’
or TOUNICODE() on an ANSI string.
So yes, your sample will hold exactly the same text you typed, but stored as Unicode, and the “what does it contain” uncertainty really only shows up when the input is not pure ASCII and you do not make the literal explicitly Unicode.
But then again I don’t even pretend to know what goes on in the mind of SoftVelocity so this is just how I am reading the room!
are different. One is ANSI, one isn’t. If I’m manipulating and building strings at runtime how am I meant to know whether a USTRING contains ANSI text or not?
The following are the rules the compiler uses to determine how to handle string literals:
If the source file is encoded as ANSI, strings literals without the U specifier before the apostrophe are taken as is. Unicode string literals with U before the apostrophe are converted by the compiler to Unicode using the codepage value set by the pragma define(codepage=>n).
If the source file has UTF-8 or UTF-16 encoding, Unicode string literals are taken as is. ANSI string literals without U before the apostrophe are converted by the compiler to ANSI using the codepage value set by the pragma define (codepage=>n).
The Default value for the codepage(when not specified by the pragma define (codepage=>n)) used by the compiler for conversions of ANSI<->Unicode is CP_ACP.
We currently have the BString that is UTF-16 and converts ANSI text using the current codepage. There currently is no option to have CLW files encoded as UTF 8 or 16 only ANSI.
What you are calling “ANSI text inside a USTRING” is really just “ASCII characters stored in Unicode”. A USTRING is UTF-16 storage. Once the value is in the USTRING, it is Unicode.
There is no flag in the variable that says “this came from ANSI” vs “this came from a Unicode literal”.
The difference between:
MyUnicode = 'Hi, I''m ANSI, how are you?'
MyUnicode = u'Hi, I''m ANSI, how are you?'
is about how the compiler interprets the literal before it ever gets assigned.
In an ANSI-encoded source file, a non-U literal is taken as raw ANSI bytes, and a U literal is converted by the compiler to Unicode using the codepage from pragma define(codepage=>n) (default CP_ACP if you do not set it).
In a UTF-8 or UTF-16 encoded source file, Unicode literals are taken as-is, and non-U literals are converted to ANSI using that same codepage setting.
At runtime, you handle it the same way you handle any other encoding boundary:
Decide what your internal representation is (USTRING as Unicode is the obvious choice).
Convert at the edges, once, when data enters or leaves (files, sockets, API calls, legacy STRING/CSTRING sources, etc.).
If you have an ANSI STRING/CSTRING at runtime and you want it in Unicode, explicitly convert it (or force a Unicode expression by concatenating with a USTRING), and from that point forward, treat it as Unicode.
So the real answer to “how do I know what it contains” is: you cannot detect the original provenance from the USTRING itself, and you should not try.
You ensure correctness by controlling where conversion happens and by knowing the encoding of the input source.
If the input bytes might be UTF-8, OEM, ACP, etc., that has always been an application-level responsibility, not something a string type can magically infer.
I think the specific example you gave Lee was a good one but I have no idea what the answer is and I don’t think you have had an answer yet?
Is this correct? In your templates you have a “PageOfPages” functionality in which the developer can enter a search string to look for at runtime. If the developer enters a unicode, or ansii constant as that search string, how do you find it at runtime?
Firstly, I assume the templates will support unicode data types??
Secondly, I am guessing that you will be able to find whatever is entered into that unicode template variable exactly the same as you do now, without having to wonder if the search string contains ansii and/or unicode?
PS: I know absolutely nothing about this so I’m just trying to get clarity for myself.
Yes, your guess is basically right, and the way to think about it is simpler than people expect.
If the template collects a search string at runtime, there is no such thing as “ANSI or Unicode inside the USTRING” once it is in a Unicode variable. The input becomes characters in Unicode storage. The only time “ANSI vs Unicode” matters is at the boundary: what type the variable is, what the control is bound to, and what conversions happen when you move data in or out.
So for your PageOfPages example:
If the template uses USTRING (or otherwise forces a Unicode expression) for the search string, then searching works the same way as today. You search for characters, not for “ANSI bytes”. It does not need a special “is this ANSI” check.
If the input comes from an ENTRY control bound to a USTRING, the user can type Unicode, and your runtime search can match Unicode, assuming the target text you are searching through is also Unicode in memory.
The one thing you do have to be consistent about is representation. If you search a Unicode buffer with an ANSI search string, Clarion will convert one side to match the other based on codepage rules. That is where surprises can happen for characters above 127. The fix is simple: make both sides Unicode before you compare.
So what we are looking for is:
The templates should support Unicode types for any user-facing text or search strings.
Then at runtime, we do not need to wonder whether the search string is ANSI or Unicode. You decide the variable type. If it is USTRING, it is Unicode.
After that just make sure the text you are searching and the search term are in the same encoding (preferably both Unicode) before doing the find.
My gut says that he may be treating “contains” here differently to you (and i agree its confusing).
I’d recommend we put a pin in this and simply explore it when it ships. I suspect most of the speculation here is “somewhat incorrect”.
Personally i speculate that it behaves as you desire Lee. (Ie the USTRING always contajns utf-16). I suspect the assignments eill be straightforward (u’ means taks as is, no u’ means ANSI to utf-16 conversion)
This is an over simplification Charles. Yes, the mapping for ASCII (not ANSI) is the same for unicode. But the encoding (in utf-16) is not the same. Its an easy conversion, yes, but its still a conversion (under the hood).
For utf-8 encoding it would use the same encoding, but utf-8 is not in play here.
The characters map 1:1 for ASCII, but when you move an ANSI/ASCII string into a USTRING the runtime still widens it to UTF-16 code units, so the bytes in memory are different.
My point was just that it is a lossless conversion for 0-127, and the real gotchas start when bytes above 127 are involved and codepage rules come into play.
I was coming to make that point, that the conversion to UTF-16 only matters for characters 128-255 when the Windows Code Page (Microsoft link) kicks in. For most of us in the USA this will be rare as the default code page 1252 covers everything we need. A few quotes from that MS page:
Windows code pages, commonly called “ANSI code pages”, are code pages for which non-ASCII values (values greater than 127) represent international characters.
For both Windows code pages and OEM code pages, the code values 0x00 through 0x7F correspond to the 7-bit ASCII character set. Code values 0x00 through 0x19 and 0x7F always represent standardized control characters and 0x20 through 0x7E represent standardized displayable characters. Characters represented by the remaining codes, 0x80 through 0xff, vary among character sets. Each character set includes different special characters, typically customized for a language or group of languages. Windows code page 1252 and OEM code page 437 are generally used in the United States.
An accidental paste of that code from here (not in a code block), Word, Email, etc, may end up with Typesetter Quotes that would be above 127 (91,92,93,94h) or Unicode characters (U+2018, U+2019, U+201C and U+201D.
Maybe it’s my understanding that "U"STRING means Unicode so the only thing it should contain, unless otherwise indicated - for whatever reason, is Unicode.
Consider attempting to concatenate two USTRING’s where one is Unicode and the other is ANSI. What you end up with is nonsense.
This could easily be done by omission of an “u” or “U” followed by ANSI… the contents would insert un-expanded ANSI into a Unicode variable.
To my way of thinking MyUstring = ‘ANSI’ should, in fact, expand that to include nulls in the form of Unicode data. If the developer wants to hazard pushing ANSI into a USTRING, sure, let them get lost and waste time attempting to find the error of their ways.
Carl, great example. This is where Unicode becomes less about storage and more about semantics.
A USTRING will store whatever characters were entered, including curly quotes from a paste. If the user later types a search term with straight quotes, those are different characters even though they look similar, so an exact compare or FIND will miss.
So the “problem” is not that the USTRING contains some mix of ANSI and Unicode. It’s that human text can contain different code points for visually similar punctuation (straight vs curly quotes, non-breaking spaces, ellipsis, dash variants, etc.).
If you want searches to behave the way users expect, you have to pick a matching policy. A common approach is to normalize for search by folding curly quotes to straight quotes (and a few other common substitutions), then compare on that normalized form.
For small datasets you can normalize on the fly during comparisons. For large datasets where you need speed and indexing, it usually makes sense to store a separate normalized “search key” field when the data is saved, and search against that, while keeping the original text unchanged for display.
When an ANSI literal or STRING flows into a USTRING context (assignment or concatenation), Clarion widens and converts it using the active codepage. The only time you get garbage is when the wrong codepage is used, or the source bytes are not actually in that codepage (common with UTF-8 misread as ANSI), or when you force Unicode back into an ANSI result.
What happens in mixed expressions depends on the type of the result you store into:
Example 1, Unicode result (safe, preserves characters):
Now the whole expression is forced to ANSI for the assignment. MyU1 is down-converted to the current codepage. Any character not representable may turn into ‘?’ or a substitute. That is expected behavior.
Bottom line: USTRING is Unicode. Mixing ANSI and Unicode is fine as long as you are intentional about where conversion happens, and you remember that converting Unicode down to ANSI can be lossy.
When you asked Z your question, you typed or pasted a literal that contains characters above ASCII (curly quotes), but he might have missed that.
Without the u prefix the compiler does not necessarily treat that literal as Unicode. In an ANSI source file, a plain ‘…’ literal is interpreted via the ANSI/codepage path first, and then it is widened into UTF-16 for storage in the USTRING.
So even though the text in your editor shows curly quotes, what actually gets compiled into the program can differ if those characters do not survive the ANSI/codepage interpretation step unchanged. (For pure ASCII it is always lossless, but above 127 it is codepage-dependent.)
That is why Z can say “it comes back exactly the same” for the case he is assuming (ASCII or codepage-safe content), and why the u’…’ form matters when you want to guarantee the literal is taken as true Unicode characters as written.
The USTRING is still UTF-16 storage either way.
The difference is whether the literal is treated as Unicode at compile time (u’…‘) or treated as ANSI bytes first (’…’ in an ANSI source file) and only then converted.
I think the correct answer is that nobody knows anything until mid January hits. I mean, until the the beta is released and people examine it themselves.