Just some topics I'd like to discuss

I’m uncertain if this is permissible so if it isn’t then feel free to remove this post.

I know that a lot of you are still wondering what’s going on with Unicode in Clarion. I have a few thoughts on that and felt the need to say “something!”

Even back in CPD there was one aspect of Clarion that, basically, was the redheaded stepchild that was always last to be handled: REPORTS! Think back, what was the last thing added to ABC? Yep, reports.

From my viewpoint there is only ONE reason Unicode is yet to be available. Take a wild guess, yes - once again, reports.

WMF’s do not support Unicode. EMF’s do. In fact if you go by the standards created by MS all TextOut commands are to be written as Unicode, not Ansi.

Until the reports can handle Unicode the remainder cannot!

Of course this brings back thoughts I’ve had about Unicode support since the subject was first brought up. Many will argue about my concept but if you truly ponder on it I’m thinking it’s the correct path for the future of Clarion.

EVERYTHING in the RTL should be rewritten to use Unicode, PERIOD! Not a bit her and a bit there but EVERYTHING! In other words what’s behind the user interface and behind the scenes within a compiled program should be Unicode and ONLY Unicode. This limits a lot of the potential problems in supporting Ansi and Unicode… make it ALL Unicode.

If you need to access Ansi data, add a simple attribute to THAT file structure and the RTL converts data, at runtime, into Unicode for internal use. Need to save that data, the RTL converts it back to Ansi because of that attribute.

Nothing else would need to be altered since from the RTL’s point of view it’s all Unicode. No need for any new data types; STRING’s and CSTRING’s would simply become Unicode strings and cstrings INTERNALLY!

And, since everything internally is Unicode all that would be needed to get it up and running would be changing to EMF’s.

Just my thoughts. I’ll crawl back under my rock now.

Have a great day!

4 Likes

I thought the C9 RTL introduced EMF reports as per this blog post.

Afraid not. The WMF’s in Clarion are simply Aldus Header WMF’s, not EMF’s.

1 Like

Alas, unfortunately Lee, its not that simple.

For the moment im going to stick a pin in the decision of “what unicode encoding” your proposed strings would be encoded as. My preference would be utf-8, but since the Windows APIs use utf-16, and theres every indication that Clarion would use utf-16, we’ll assume its utf-16.

Changing Cstrings to utf-16 would be tricky. It would break a lot of existing code since utf-16s contain lots of nulls, and are terminated with null-null.

Clarion programmers use Strings in two distinct ways; as “strings of text” but also as “blocks of memory”. We don’t really distinguish between the two now, but being able to tell those use cases apart will be necessary in the future.

Simply converting all Strings to Utf16Strings would also break a lot (like almost all) existing hand/embed code. Currently we treat SIZE and LEN as interchangeable; we talk about “the number of characters we can store in the string”, theres string slicing and so on.

In other words our very understanding of a “string” needs to change as we adopt Utf16 or Utf8 as a storage type.

At a fundamental level these are all different data types. The language needs to reflect that. They’re different in the ways that REAL and DECIMAL are different, useful for different things.

Being able to tell one from another is critical in communications, APIs, VSV, JSON , XML, plus every communications protocol.

Just changing the STRING type under the hood would basically break all but the most trivial of programs. A Clarion version that did that would simply be ignored by the community.

Incidentally your approach has been tried. Python 3 did what you suggested, causing Python 2 programs to break. There is a lot of interesting history online regarding this transition.

2 Likes

Bruce,

We can disagree all day long but I think you might not have understood what I was proposing.

Regardless of how many bytes define a character that character is still, at the RTL level, one character. LEN(), a runtime function would already KNOW it’s not Ansi and behave exactly how it does today. As an example, a string with 4 characters would return 4 regardless of how many bytes are being held. Even if the data read into that string is Ansi, from the users POV, the RTL would see it and handle it as Unicode. IE: the RTL needs to be written to disregard the number of bytes in a single character.

If you want to insert 4 characters at the 10th character within a string the RTL knows to use the actual position as is necessary for that character to exist in Unicode.

IOW, if you use this code…

MyString STRING(80)

It would, from the users pov, and yours, still be a string of 80 CHARACTERS regardless of how many bytes it takes to define that string in Unicode. It’s the RTL that handles all the boring stuff just like it already does, the only change would be internally, to the RTL, it’s Unicode.

Even a file structure that holds Ansi data would continue to hold Ansi. When a string is read from that file, because of a new attribute, the RTL would convert, internally, to Unicode and when you save that record, again due to the attribute, converts it BACK to Ansi.

I’m very familiar with double nulls and the RTL would also. It’s what the RTL is supposed to do so we don’t have to write everything in machine code!

SV has a history of biting off more than they can chew and has a problem with visions of grandeur that seem to always fall short or never come to fruition. A perfect example, ClarionSharp!

I argued for months to NOT change the syntax since doing so would create a language that is not longer Clarion. Case in point, making everything ZERO based instead of the true Clarion default of being ONE based.

Logic such as IF NOT INSTRING() would crash and burn so a massive rewrite would have been necessary for any existing code. Bad idea since that just became yet another cookie-cutter dot net language… as if there weren’t already enough of them.

I yelled about it but was overruled by a few names bigger than mine.

Anyone using ClarionSharp for production these days? None that I’m aware of.

The current buggy IDE was adopted because SV thought the future was dot Net, and now we have problems simply aligning prompts and entry controls in the window designer.

From my point of view, and I realize I’m just an old fart that belches a lot, that move was seriously wrong. But my reasons were ignored then as they are now.

There are a few parts of the IDE that I did influence and I’m kinda happy that I did.

The original used drop downs for T/F choices, ie: a freak’n checkbox. Also, if you’re in the window formater you can select a control, PRESS ENTER, and it takes you to that controls properties and once you make your changes you can ESC the focus back to the designer to continue. Granted I’ve been labelled “Mr. Keyboard” but I began this Windows venture adhering to MS’s own SWB. That, if anyone recalls, is to use a pointing device to move, resize of draw something and ALL other uses were to be available on the keyboard. A logical decision that they later decided to toss in the trash.

In one way am I saying EVERYTHING could remain “as is” but the vast majority could be if the RTL was rewritten to use only Unicode internally.

(crawling back under my rock)

3 Likes

As an example

MyString STRING(80)

Are you stuck in thinking that’s 80 bytes or 80 characters?

I consider it 80 characters since in Unicode those characters could be comprised of 2, 3, 4 bytes. Since they can vary exactly how do you define a Unicode string length since you have no prior knowledge of exactly how many bytes each character requires?

My email client uses UTF-8. Some of the emoticons I use are single byte, some are 2 bytes and some are 3 bytes.

Don’t waste time attempting to count bytes, count characters. And if the RTL handles it correctly then 80 characters can be held within MyString regardless of how many bytes it takes.

Bytes are numeric data, not character counts.

(back under my rock)

2 Likes

Great discussion , and im now chatting to our chief to get a better understanding of U 8 as its what we use all the time… AI generated code to support counting unicode in CCP…

           char *ptr = (char*)scriptStringAsString(astr);
	while (scriptStringNextUc4(&ptr) != -1) count++;

Its an interesting discussion Lee, but unfortunately the devil is in the detail.

Firstly, unicode does not deal in characters, but code points. Up to 6 code points can be used to form a character. Thus a single character could take 12 bytes.

Declaring a String(80) as a declaration of a String that can hold 80 characters would be 960 bytes of ram long.

In other words, when using unicode we let go of the concept of “number if characters that can fit in this string”.

My point with LEN and SIZE being the same in ANSI means that a lot of existing code is “incorrect”. So apart from performance issues (in unicode SIZE is fast, returns memory size, LEN is slow since it needs to parse the whole string) they would start returning different numbers, which in turn would break a lot of existing code.
As an aside theres no language concensus on what LEN actually should return, Clarion would need to pick on - google that for more fun.

But wait, there’s more. Strings can be used as “raw memory”. Thats what OVER is for. Thats what string slicing is for. Both of those would be irreparably broken.

While i appreciate things can be done inside the RTL, most code is not in the RTL. The RTL does not implement protocols, does not import or export, does not facilitate communications ‐ all of which require byte-level control.

Again I refer you to Python, which tried the path you are proposing. They did that in 2008, and official support (meaning updates) to Python 2 ended in 2020. Maintaining C11 for another 12 years seems unlikely.

I post this less yo argue with you, but rather to point out that, as simple as it seems, its not a solution that would work for existing programs.

With SharpDevelop SV was able to buy an IDE with the modern features seen in Visual Studio as that was the goal of the project. Pads that tear off and can be snapped / positioned. The C6 IDE was very modal. Using MS Build to compile has its advantages.

Overall I like the new IDE as likely better than a port of the old IDE to 32 bit. Sure it has its awkward and modal bits in making it work with APPs and Templates. The Screen Designer was made for WinForms so has some problems but works well for me everyday.

This was not the main point of your topic, I just wanted to say to Z that I’m pleased with the IDE. I doubt a port of the 16 bit modal IDE would be a product that they could sell.


On your main point … maybe a way to view where are is like near the start of Unicode with C++. There were TCHAR types and functions that worked for ANSI and UTF-16. Some think that was a mistake to try to code for both ways… Which is what I think you’re saying

The TEXT and TCHAR macros are less useful today, because all applications should use Unicode.

BTW the Windows 64 bit API is Unicode only.

1 Like

C12 IDE is much better than C6 IDE. This includes less modality. Modality within a single app can be bypassed by creating a copy of the app. The window designer in C12 IDE has issues with coordinate recalculation. At one point, I considered bypassing this problem by using an alternative window designer. We copy the text representation of the window via the clipboard from the IDE to our internal application, edit it, and then copy it back to the IDE.

However, this approach was later abandoned in favor of adapting to the standard designer and dynamically aligning controls. The biggest problems in the C12 IDE are freezing and gpf in a number of situations. The code generator may not work correctly from time to time. We adapt to this by frequently saving our work and restarting the IDE. Overall, I have already adapted and am quite comfortable working in the C12 IDE. It won’t get any better.

As for Unicode, in other development systems, the transition to Unicode was accompanied by a loss of backward compatibility. Windows NT and its subsequent versions operate in Unicode and re-encode ANSI to ensure backward compatibility. Similarly, MS approaches 64-bit operating systems by supporting win32 applications through wow64. MS’s overall approach to transitioning to 64-bit applications and Unicode has been consistent for many years. However, we face a dilemma. Follow MS and lose application backward compatibility, or stay on win32 and ansi and rely on Windows backward compatibility tools.

In summary i think perhaps after reading bruces comments one feels it might be a long wait for the next release of C12 , skip thirteen and move to 14…

Luckly 10 years ago we moved to a new concept on Linux, a pure binding machine in CPP where even what looks like a scripting language is just a binding…

Support for unicode was a default and therefore a none issue im told…the last 12 months i ported a private version of the linux solution and now have it running some older Clarion hand coded projects… giving new life to our clarion win32 projects…

Anyway that most Clarion Developers using the power of the APP GEN and ABC wont want to revert to hand code not even for Cape Softs NetTalk…

One fabulous feature of AI documents projects is that Generating ABC templates could become pretty simply a case of AI generating the Source and the Templates…

You dont need clarion 14 to do that it here right now.. In fact new ABC paradigms could be created and power up your APP GENS…

Its just that this hasnt been tried much by the EXPERTS on Clarion LIVE..but im sure it will soon as Project Document size limits increase.. They seem to be a bit stuck at the moment as the AI Data centre build out goes into orbit…

SV really needs to communicate a little better but this has been the case for a while.

They appear to have a monster job of their hands and they for some reason are a bit stalled…

Hurrah! At last! Talk of UNICODE!
I may be unique in the Clarion world but most of my apps are Hebrew (RTL) or at least include Hebrew, ask Sparky.
We were hoping that C12 would address that, but alas . . .

Sim

the Irony is Windows NT started life with Unicode first and foremost.

The original premise was to help others realize why Unicode is not yet supported, ie: reports.

In addition I did state several ideas having to do with Unicode support.

Bottom line is how you interpret this variable definition…

MyString STRING(80)

Until Unicode you could read that as 80 bytes or 80 characters and everything would stay as is. But with Unicode characters no longer being only single-byte this concept falls apart.

I’ve always read STRING(80) as being 80 characters but that no longer holds true if you’re still stuck in a byte = character world. Generally speaking every string based command within Clarion could be shifted to Unicode —IF— Unicode was what the RTL expected.

Even string slicing remains viable… just think CHARACTERS instead of bytes! The value, whether it’s a Unicode string or Ansi remains viable.

If you absolutely need a “string” of bytes you have a viable option already available…

MyBytes BYTE,DIM(80) which can be over anything you want or…

Add a new attribute for defined variables and files. Call it “ANSI” if you like. On a file structure it indicates the data is stored ANSI instead of Unicode. On a string variable it would indicate, to the RTL, that it’s a FIXED LENGTH string variable NOT to be handled internally as Unicode, ie: a STRING(80),ANSI would, indeed, remain as 80 bytes.

No one ever said that rewrites wouldn’t be necessary but why make them impossible to conceive such as "how many Unicode chrs fill a STRING(80)?

Someone referenced TCHAR’s so I’ll ask, how many bytes are they, EACH???

For files using ANSI the RTL knows to expand that Unicode when read and return to ANSI when saved. All your existing files remain viable and you can move to Unicode data in the future if needed.

For anyone not liking that approach then use an attribute to indicate Unicode for a file or variable although, personally, I’d go with the ANSI attribute.

Now I have a headache so I’ll step to the side.

MyString            STRING(80),ANSI    !fixed, non-expandable so no Unicode
MyBytes             BYTE,DIM(80),OVER(MyString)

If the _UNICODE flag is False then a TCHAR is a BYTE 8 bit Char and the ANSI functions are called.

If the _UNICODE flag is True then a TCHAR is a USHORT 16 bit WChar UTF-16 and the W Unicode functions are called.

So a TString(80) would be 80 WChar so 160 bytes. That would not be enough space for a character set like Chinese that can take 3 or 4 Bytes per “symbol”.

This was a way MS created to have Windows C code compile both ways to ease the transition to Unicode.

I think the SV “unified STRING type” seems to be somewhat like that. We have no info beyond a short description. I would guess each STRING will have a _UNICODE flag set to true by the RTL at assignment of a Wide String, which will cause the RTL to call the wide functions.

Bruce wants to talk about Unicode from a wider point of view that covers everything. I think we can take a more narrow focus of Windows Unicode that is primarily UTF-16 “Wide Strings” that call the W API. Once we have that nailed down we can pursue more.

Lee I do Not see what you propose working for me, too many code changes or breakage. Maybe as a flag that can be turned on to try it

Can you expand on that?

What I envision is a workable solution that doesn’t require a rewrite for Unicode. If you define a STRING(20) then that variable SHOULD be aware that it’s to hold Unicode UNLESS you add an attribute to indicate it’s a fixed length and therefore NOT Unicode; ie: exactly like things are today, Ansi.

Remember, I’m not talking bytes, I’m talking characters.

The core of Windows does this now. For most “string” based API calls you call SomethingA that string, internally, gets converted to Unicode, the function called is SomethingW and once that function has completed it is reconverted to Ansi. In fact, if I’m not mistaken, MS suggests that all API calls be Wide (Unicode) instead of Ansi since that processes faster.

How would making the Clarion RTL Unicode based - throughout, make things more complicated? I agree it would make SV’s job more difficult but it shouldn’t make OUR jobs more difficult.

Some developers would need to rethink a STRING() as what it is, the number of characters that a variable will hold, not the byte count. Put the difficult, boring crap in the RLT. It’s supposed to do that for you, correct?! :thinking:

As I asked, what code changes would be necessary for what commands? Assuming you can rethink in the character count instead of the byte count?

And as to a flag, that’s what I suggested YEARS ago. Make an APP W or A, period, but even then you would need to make subtle changes where you’re using a STRING() as an easy way to manipulate BYTES in a string. Most commands in Clarion can be “re-seen” as character counts instead of byte counts. Even string slicing, [this is the chr count] not [this is the byte count]. With Unicode these numbers would be equal in the command.

It’s an afternoon and I have laundry to wash!!! :wink:

If a String 20 were guarenteed to hold 20 character’s, it would need to be allocated 240 bytes.

Since all global threaded variables are allocated on the construction of a thread, and since all FILE structures are presumably THREADed , you can expect more-or-less an order of magnitude of RAM more to be allocated per thread.

Performance of all string handling functions would be poor.

The addition of an attribute to the declaration is equivalent to a different data type. Except that,

a) an attribute is opt-out not opt-in. Meaning code would be broken by default until fixed. A different type means opt-in, which in turn means code works out the box, but extends functionality as desired.

b) Clarion allows for passing-by-reference. This makes it possible to have 2 functions, one taking a *STRING, and another taking a *USTRING. If you used an attribute this would not be possible.

c) From a language (syntax) point of view, a new type or an attribute are the same, other than the syntax. Up to now Clarion has declared different Types for different characteristics. We have a STRING and a CSTRING, not STRING and STRING,C. So adding an attribute would be inconsistent.

Incidentally in other (compiled) languages the concept of Character Count is somewhat moot. If you want to provision based on character length you need dynamic strings, at which point theres no limit to length. If you want to think in units think Code Points not Characters.

Of course this whole approach of just treating everything as a wide string steps past a lot of other unicode realities which are important. For example strings, especially input or imported strings, need to be normalised or filtering and sorting will break. To keep existing programs working opt-in, not opt-out will be necessary.

To be honest I’m really not sure what problem you are tring to solve. For reports converting ANSI to utf-16 is trivial. An EMF report engine should have no problems at all with an ANSI string (converting as needed.) All it needs to do is Refect the data type.

All I can suggest is that you consider Nchar and Nvarchar which are defined by the character count independent of the bytes required and are specifically created for Unicode.

Suppose a client says they need an entry for 40 characters. How would YOU define that variable so it holds up to 40 Unicode characters.

I’m waiting for someone to give me a string processing command in Clarion that would stumble if the RTL was counting characters, not bytes. Makes sense with Ansi since it’s 1:1.

Basically all I’m saying is if the RTL continues to conceive of bytes instead of characters then moving to Unicode is going to be a nightmare.

What I’m requesting feedback on is the concept that would allow Unicode and Ansi to coexist in the same APP with VERY little editing.

The original point, which I’ve already made, is that until Clarion switches to an EMF based report image that Unicode cannot be used, unless your end user doesn’t want any native reports.

As to Unicode in EMF’s the definition provided by MS indicates that all TextOut should be, yes, SHOULD BE, in Unicode. Makes sense since you wouldn’t have to guess if the string was Ansi or Unicode. Possibly the same reason Windows is Unicode and the reason MS recommends no longer calling FunctionA API’s.

And wasn’t the USTRING concept dropped???

I’m not all that interested in Unicode, especially since I’m working toward my “Going Out of Business” sale for next year. If Clarion releases EMF’s before I close up I’ll add the support required in RPM, CRT and AFE. If it’s after the sale I’ll consider writing a universal fix that can be used by anyone for such things as Page of Pages and text searches.

And that right there is the problem. Clarion forever has had character=byte Strings atm are also used for a lot of binary data that would break if your scheme was adopted. Yes a lot of functions would be made to work but things like String slicing would break, because it depends on the Char=Byte.
I have code that parses binary data and I depend on the STRING holding so many bytes. My code would break under your scheme, I’d have to rewrite whole portions to get the C=B relationship back again.

One of the really really big things with Clarion has been backward compatibility. I’d be happy if there’s a USTRING so, as Bruce suggested I could opt in, but I’ll kick up a stink if I’ve got to rewrite a shit ton of code.

Python did this. They went from V2 with C=B to V3 = Unicode and it took 10 years to be accepted due to wholesale breakages.

Retiring or GOOB?

I think Gus @ CHT and @noyantis had/have the right idea, charging an annual subscription.

Its what I’m planning on doing because it makes less business sense for a customer to benefit from the continued use of a prior release, plus it also then demands new features in the next paid for release, which not everyone will like or want.

Back in the 90’s and 00’s Ansi made sense because the world wasnt very globalised.
Today, with immigration and cheap air travel Unicode/multi byte makes sense.

My only concern, is how much more it will slow down db index’es.

Looking at the windows implementation of unicode, it seems to guess what the character mapping should be. See WC_NO_BEST_FIT_CHARS

For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for “∞” (infinity) maps to 8 (eight) in some code pages.