Main Code. The UTF-16 is done with a STRING type. When we get the Unicode C12 it should convert to USTRING fairly easily.
MAP
SurrogateTest PROCEDURE()
IsSurrogate PROCEDURE(STRING Utf16WideChar),BYTE !1=High (1st); 2=Low (2nd); 0=Not Surrogate
UCode PROCEDURE(LONG inUtf32),STRING !UTF-32 Code Point returns Surrogate Pair, unless UTF-16 works
WChr PROCEDURE(USHORT Utf16CodeUnit),STRING !Wide Character in Little Endian String for UTF-16 Code Unit
WChrs PROCEDURE(USHORT Unit1, USHORT Unit2),STRING !2 Wide Character Code Units e.g. Surrogate Pair High,Low or Char,Combining Mark
WVal PROCEDURE(STRING Utf16WideChar),USHORT !Return UShort of Wide Character in Little Endian String UTF-16 Code Unit
WAscii PROCEDURE(STRING inAscii),STRING !Convert ASCII to Wide by simple add <0> no Code Page
HexDump PROCEDURE(STRING inUtf16, BOOL BigEnd=0),STRING !Utf 16 in Hex formated for Debug Message() to verify Surrogates
HexS1 PROCEDURE(STRING InStr1),STRING !HEX digits of 1 byte String
Module('Win32')
MsgBoxAnsi PROCEDURE(LONG hWnd=0,*CSTRING lpText, <*CSTRING lpCaption>, LONG uType=0),PASCAL,RAW,LONG,PROC,DLL(1),NAME('MessageBoxA')
MsgBoxWide PROCEDURE(LONG hWnd=0,*STRING lpText_Strg, LONG lpCaption_LPCWSTR, LONG uType=0),PASCAL,RAW,LONG,PROC,DLL(1),NAME('MessageBoxW')
END
MsgBoxAPI PROCEDURE(STRING sText, STRING sCaption) !RTL Message stopped working?
END
wNull EQUATE('<0,0>') !Wide Null
wLf EQUATE('<10,0>') !Wide 10 is all the API MessageBox needs
wTab EQUATE('<9,0>') !Wide Tab
CODE
SurrogateTest()
SurrogateTest PROCEDURE()
CapBStr BSTRING !Wide String we have
CapWStr LONG,OVER(CapBStr) !Points to the WString inside the BString group of Address & Size
MsgWStr STRING(8000) !Change to WSTRING
CODE
CapBStr = 'Surrorgate Test UCode() UTF-32' !https://en.wikipedia.org/wiki/List_of_emojis
MsgWStr = UCode(20ACh) & wTab & WAscii('U+020AC Euro') &| !UCode() handles UTF-16 fine
wLf & UCode(1F603h) & wTab & WAscii('U+1F603 Smile') &| !Smiling Face with Open Mouth
wLf & UCode(1F4A9h) & wTab & WAscii('U+1F4A9 Poo 1') &| !Pile of Poo 1 UTF-32
wLf& WChrs(0D83Dh,0DCA9h) &wTab & WAscii('D83D,DCA9 Poo 2') &| !Pile of Poo 2 Surrogates
wLf & UCode(20FFFFh)& wTab & WAscii('U+20FFFF Invalid') &| !show <?> > valid range
wLf & UCode(0E9h) & wTab & WAscii('U+00E9 Caf') & UCode(0E9h) &| !U+00E9 LETTER e WITH ACUTE ACCENT https://en.wikipedia.org/wiki/Precomposed_character
wLf& WChrs(65h,301h) & wTab & WAscii('U+0301 Cafe')& UCode(301h) &| !e + U+0301 COMBINING ACUTE ACCENT https://en.wikipedia.org/wiki/Combining_character
wNull
MsgBoxWide(,MsgWStr,CapWStr) !Show Unicode with Windows Message Box W
MsgBoxApi('Little Endian:|={30}|'& HexDump(MsgWStr) & | !Debug to see Hex
'||Big Endian:|={30}|' & HexDump(MsgWStr,1) & |
'||Surrogate Pairs are High D800-DBFF, and Low DC00-DFFF|Windows Little Endian these bytes are flipped 00DXh ' & |
'','Hex Dump UCode() UTF 16 & 32')
RETURN
I created these functions to handle UTF-16. A common need is to change from from Big Endian used in number types e.g. UShort(20ACh) to Little Endian used in Strings STRING('<0ACh,20h>'). This canbe done most simply with an OVER().
UCode(Long) takes a UTF 32 and returns the Surrogate Pair. The code is pretty simple splitting it into two 10 bit parts.
UCode PROCEDURE(LONG inUtf32) !UTF-32 Code Point returns Surrogate Pair, unless UTF-16 works
Utf16Str STRING(2),OVER(inUtf32) !Little Endian USHORT UTF-16 of ULong's first 16 bits
U20 LONG,AUTO !Max Unicode is 10FFFF not close to 7FFF FFF max Long
W1 USHORT,AUTO
W2 USHORT,AUTO
CODE
CASE inUtf32
OF 0001h TO 0FFFFh !UTF-16 Range so just UShort 2 bytes
RETURN Utf16Str
OF 010000h TO 10FFFFh !Range of Characters with Surrogate Pairs, process below End Case
ELSE
RETURN WChr(0FFFDh) !Invalid = U+FFFD <?> https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
END !Case
U20 = inUtf32 - 10000h ! 0x10000 is subtracted leaving a 20-bit number (U') in the range 0x00000–0xFFFFF
W1 = 0D800h + BSHIFT(U20,-10) ! High 10 bits add to 0xD800 = 1st 16-bit code unit or High Surrogate (W1), range 0xD800–0xDBFF
W2 = 0DC00h + BAND(U20,3FFh) ! Low 10 bits add to 0xDC00 = 2nd 16-bit code unit or Low Surrogate (W2), range 0xDC00–0xDFFF
RETURN WChr(w1) & WChr(W2) ! Return High & Low Surrogate
!------------------------------------------------------------------------------
WChr PROCEDURE(USHORT Utf16CodeUnit) !Wide Character in Little Endian String for UTF-16 Code Unit
WStrUtf16LE STRING(2),OVER(Utf16CodeUnit)
CODE
RETURN WStrUtf16LE !Clarion 12 Unicode can use CHR(#, true) for UTF 16
WChrs PROCEDURE(USHORT Unit1, USHORT Unit2) !2 Wide Character Code Units e.g. Surrogate Pair High+Low
CODE !Todo allow up to 21 <Unit#>
RETURN WChr(Unit1) & WChr(Unit2)
!------------------------------------------------------------------------------
WVal PROCEDURE(STRING Utf16WideChar) !,USHORT !Return UShort of Wide Character in Little Endian String UTF-16 Code Unit
ValUtf16 USHORT,OVER(Utf16WideChar) !Better pass in STRING(2) or ????
CODE
RETURN ValUtf16 !Clarion 12 Unicode can use VAL(Char, true) for UTF 16
!------------------------------------------------------------------------------
IsSurrogate PROCEDURE(STRING Utf16WideChar) !,BYTE !1=High (1st); 2=Low (2nd); 0=Not Surrogate
ValUtf16 USHORT,OVER(Utf16WideChar) !Better pass in STRING(2) or ????
HiLoNo BYTE
CODE
CASE ValUtf16
OF 0D800h TO 0DBFFh ; HiLoNo=1 !High 1st
OF 0DC00h TO 0DFFFh ; HiLoNo=2 !Low 2nd
END
RETURN HiLoNo !Clarion 12 Unicode needs Surrogate functions Is/To/From
!------------------------------------------------------------------------------
WAscii PROCEDURE(STRING inAscii) !Convert ASCII to Wide by simple add <0> HACK no Code Page consideration
W STRING(SIZE(inAscii)*2),AUTO !W = Wide ASCII to Return
i USHORT,AUTO
j USHORT
CODE
LOOP i=1 TO SIZE(inAscii)
W[j+1]=inAscii[i] !Little Endian ASCII Val in Byte 1
W[j+2]=CHR(0) ! + Zero as Byte 2
j+=2
END
RETURN W !Clarion 12 assign Unicode = ANSI does this, plus handles Code Page
!------------------------------------------------------------------------------
HexDump PROCEDURE(STRING inUtf, BOOL BigEnd=0) !Utf 16 in Hex formated for Message() to Debug and verify code, to see Surrogates
Dmp ANY
X LONG
Wide2 STRING(2)
HHHHs STRING(5)
CODE
LOOP X=1 TO SIZE(inUTF) by 2
Wide2 = InUTF[X : X+1]
CASE Wide2
OF '<0,0>' ; BREAK !Null is the End, don't show 0000
OF '<10,0>' ; Dmp=Dmp & '|' ; CYCLE !No need to see 0A00 for Debug
END
HHHHs=CHOOSE(~BigEnd, HexS1(Wide2[1])&HexS1(Wide2[2]) , HexS1(Wide2[2])&HexS1(Wide2[1]))
CASE IsSurrogate(Wide2)
OF 1 ; HHHHs[5] = '+' !Surrogate 1 aka High with +Low instead of Space
OF 2 ; HHHHs=lower(HHHHs) !Surrogate 2 aka Low in lower case
END
Dmp = Dmp & HHHHs
END
RETURN Dmp
HexS1 PROCEDURE(STRING InStr1) !,STRING !HEX digits of 1 byte String
In BYTE,OVER(InStr1)
Hx STRING('0123456789ABCDEF')
CODE
RETURN Hx[BSHIFT(in, -4) + 1] & Hx[BAND(in, 0FH) + 1]
!------------------------------------------------------------------------------
MsgBoxAPI PROCEDURE(STRING sText, STRING sCaption) !RTL Message() stopped showing all the content
cText CSTRING(SIZE(sText)+2)
cCaption CSTRING(SIZE(sCaption)+120)
xPipe LONG
CODE
cText=CLIP(sText)
cCaption = CLIP(sCaption) &'...(Text on Clipboard or Ctrl+C) '& All('.')
LOOP
xPipe=INSTRING('|',cText) ; IF ~xPipe THEN BREAK.
cText[xPipe]=CHR(10)
END
! SETCLIPBOARD(sText) ; Message (sText,cCaption) !Clarion, but original sText so has |||
SETCLIPBOARD(cText) !PAste this in an Editor for better screen capture
MsgBoxAnsi(,cText,cCaption)
RETURN
The Clarion RTL should have some of these. We should have functions for Surrogates. As I mentioned in my post to Z and above that I would like Source Code to allow UTF-32 code points like it is allowed in C# VB.net C++.