Unicode UTF-16 Surrogate Pairs Example

Based on my Hub Post and RZ Reply I created a Clarion 11/12 example showing Unicode Code Points above the 16 bit BMP (> FFFFh) that require Two UTF-16 code units. These are called Surrogate Pairs and are in special ranges D800h-DBFF and DC00 TO DFFF. I find these ugly in code, worse they hide the original code point.

E.g. I want the Emoji :poop: “Pile of Poo” U+1F4A9 coded as Button(‘<1F489h>’) that is obvious instead of Button(‘<0D83Dh,0DCA9h>’) with a Surrogate Pair that hides the common single U+ code.

I created the UCode(Long) function that can take a 32-bit Unicode Code Point like 1F489h and make the 2 UTF-16 Surrogates. Here is some code that shows Emoji’s and other Unicode in a WinAPI MessageBoxW.

The Poo emoji is coded both as 1 code 1F4A9h and below a Surrogate Pair D33D,DCA9. The word Cafe’ is done two ways Precomposed & Combining that will be discussed below, it is not a Surrogate. The Euro Sign is in the BMP so is simply 20ACh.

Next a Hex Dump of the UTF-16 shows in a Message and its on the Clipboard. The Surrogate Pairs have a Plus Sign (+) between them with the 2nd lowercase to make them standout. Also a Big Endian version shows which is easier to understand. You can see the Unicode for Poo 1 and 2 have the same in UTF-16.



The Message Box font is small so very poor at showing Emoji’s. You can press Ctrl+C to copy the message, then paste it into other programs like Word, Email or here for a bigger better rendering:

€	U+020AC Euro
😃	U+1F603 Smile
💩	U+1F4A9 Poo 1
💩	D83D,DCA9 Poo 2
�	U+20FFFF Invalid
é	U+00E9 Café
é	U+0301 Café
2 Likes

Project Download

SurrogateTest_v1_20260118_112802.ZIP (4.6 KB)

1 Like

Main Code. The UTF-16 is done with a STRING type. When we get the Unicode C12 it should convert to USTRING fairly easily.

  MAP
SurrogateTest   PROCEDURE()
IsSurrogate     PROCEDURE(STRING Utf16WideChar),BYTE  !1=High (1st); 2=Low (2nd); 0=Not Surrogate
UCode           PROCEDURE(LONG inUtf32),STRING          !UTF-32 Code Point returns Surrogate Pair, unless UTF-16 works
WChr            PROCEDURE(USHORT Utf16CodeUnit),STRING  !Wide Character in Little Endian String for UTF-16 Code Unit
WChrs     PROCEDURE(USHORT Unit1, USHORT Unit2),STRING  !2 Wide Character Code Units e.g. Surrogate Pair High,Low or Char,Combining Mark
WVal            PROCEDURE(STRING Utf16WideChar),USHORT  !Return UShort of Wide Character in Little Endian String UTF-16 Code Unit
WAscii          PROCEDURE(STRING inAscii),STRING        !Convert ASCII to Wide by simple add <0> no Code Page
HexDump         PROCEDURE(STRING inUtf16, BOOL BigEnd=0),STRING  !Utf 16 in Hex formated for Debug Message() to verify Surrogates
HexS1           PROCEDURE(STRING InStr1),STRING         !HEX digits of 1 byte String
    Module('Win32')
MsgBoxAnsi  PROCEDURE(LONG hWnd=0,*CSTRING lpText, <*CSTRING lpCaption>, LONG uType=0),PASCAL,RAW,LONG,PROC,DLL(1),NAME('MessageBoxA')
MsgBoxWide  PROCEDURE(LONG hWnd=0,*STRING lpText_Strg, LONG lpCaption_LPCWSTR, LONG uType=0),PASCAL,RAW,LONG,PROC,DLL(1),NAME('MessageBoxW')
    END
MsgBoxAPI   PROCEDURE(STRING sText, STRING sCaption)       !RTL Message stopped working?
  END
wNull   EQUATE('<0,0>')         !Wide Null
wLf     EQUATE('<10,0>')        !Wide 10 is all the API MessageBox needs
wTab    EQUATE('<9,0>')         !Wide Tab
  CODE
  SurrogateTest()

SurrogateTest   PROCEDURE()
CapBStr     BSTRING                 !Wide String we have
CapWStr     LONG,OVER(CapBStr)      !Points to the WString inside the BString group of Address & Size
MsgWStr     STRING(8000)            !Change to WSTRING
    CODE
    CapBStr = 'Surrorgate Test UCode() UTF-32'                              !https://en.wikipedia.org/wiki/List_of_emojis
    MsgWStr = UCode(20ACh)  & wTab & WAscii('U+020AC Euro')     &|          !UCode() handles UTF-16 fine
        wLf & UCode(1F603h) & wTab & WAscii('U+1F603 Smile')    &|          !Smiling Face with Open Mouth
        wLf & UCode(1F4A9h) & wTab & WAscii('U+1F4A9 Poo 1')    &|          !Pile of Poo 1 UTF-32
   wLf& WChrs(0D83Dh,0DCA9h) &wTab & WAscii('D83D,DCA9 Poo 2')  &|          !Pile of Poo 2 Surrogates
        wLf & UCode(20FFFFh)& wTab & WAscii('U+20FFFF Invalid') &|          !show <?> > valid range
        wLf & UCode(0E9h)   & wTab & WAscii('U+00E9 Caf') & UCode(0E9h)  &| !U+00E9 LETTER e WITH ACUTE ACCENT  https://en.wikipedia.org/wiki/Precomposed_character
       wLf& WChrs(65h,301h) & wTab & WAscii('U+0301 Cafe')& UCode(301h)  &| !e + U+0301 COMBINING ACUTE ACCENT  https://en.wikipedia.org/wiki/Combining_character
              wNull
    MsgBoxWide(,MsgWStr,CapWStr)                                            !Show Unicode with Windows Message Box W

    MsgBoxApi('Little Endian:|={30}|'& HexDump(MsgWStr) & |                 !Debug to see Hex
            '||Big Endian:|={30}|' & HexDump(MsgWStr,1) & |
            '||Surrogate Pairs are High D800-DBFF, and Low DC00-DFFF|Windows Little Endian these bytes are flipped 00DXh ' & |
            '','Hex Dump UCode() UTF 16 & 32')
    RETURN

I created these functions to handle UTF-16. A common need is to change from from Big Endian used in number types e.g. UShort(20ACh) to Little Endian used in Strings STRING('<0ACh,20h>'). This canbe done most simply with an OVER().

UCode(Long) takes a UTF 32 and returns the Surrogate Pair. The code is pretty simple splitting it into two 10 bit parts.

UCode PROCEDURE(LONG inUtf32)     !UTF-32 Code Point returns Surrogate Pair, unless UTF-16 works
Utf16Str STRING(2),OVER(inUtf32)  !Little Endian USHORT UTF-16 of ULong's first 16 bits
U20 LONG,AUTO                     !Max Unicode is 10FFFF not close to 7FFF FFF max Long
W1  USHORT,AUTO
W2  USHORT,AUTO
    CODE
    CASE inUtf32
    OF   0001h TO  0FFFFh   !UTF-16 Range so just UShort 2 bytes
        RETURN Utf16Str
    OF 010000h TO 10FFFFh   !Range of Characters with Surrogate Pairs, process below End Case
    ELSE
        RETURN WChr(0FFFDh) !Invalid = U+FFFD <?> https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
    END !Case
    U20 = inUtf32 - 10000h          ! 0x10000 is subtracted leaving a 20-bit number (U') in the range 0x00000–0xFFFFF
    W1  = 0D800h + BSHIFT(U20,-10)  ! High 10 bits add to 0xD800 = 1st 16-bit code unit or High Surrogate (W1), range 0xD800–0xDBFF
    W2  = 0DC00h + BAND(U20,3FFh)   ! Low 10 bits  add to 0xDC00 = 2nd 16-bit code unit or Low Surrogate  (W2), range 0xDC00–0xDFFF
    RETURN WChr(w1) & WChr(W2)      ! Return High & Low Surrogate
!------------------------------------------------------------------------------
WChr PROCEDURE(USHORT Utf16CodeUnit)        !Wide Character in Little Endian String for UTF-16 Code Unit
WStrUtf16LE STRING(2),OVER(Utf16CodeUnit)
    CODE
    RETURN WStrUtf16LE                      !Clarion 12 Unicode can use CHR(#, true) for UTF 16
WChrs PROCEDURE(USHORT Unit1, USHORT Unit2) !2 Wide Character Code Units e.g. Surrogate Pair High+Low
    CODE                                    !Todo allow up to 21 <Unit#>
    RETURN WChr(Unit1) & WChr(Unit2)
!------------------------------------------------------------------------------
WVal PROCEDURE(STRING Utf16WideChar)        !,USHORT  !Return UShort of Wide Character in Little Endian String UTF-16 Code Unit
ValUtf16 USHORT,OVER(Utf16WideChar)         !Better pass in STRING(2) or ????
    CODE
    RETURN ValUtf16                         !Clarion 12 Unicode can use VAL(Char, true) for UTF 16
!------------------------------------------------------------------------------
IsSurrogate PROCEDURE(STRING Utf16WideChar) !,BYTE  !1=High (1st); 2=Low (2nd); 0=Not Surrogate
ValUtf16 USHORT,OVER(Utf16WideChar)         !Better pass in STRING(2) or ????
HiLoNo BYTE
    CODE
    CASE ValUtf16
    OF 0D800h TO 0DBFFh ; HiLoNo=1  !High 1st
    OF 0DC00h TO 0DFFFh ; HiLoNo=2  !Low  2nd
    END
    RETURN HiLoNo                           !Clarion 12 Unicode needs Surrogate functions Is/To/From
!------------------------------------------------------------------------------
WAscii PROCEDURE(STRING inAscii)            !Convert ASCII to Wide by simple add <0> HACK no Code Page consideration
W   STRING(SIZE(inAscii)*2),AUTO            !W = Wide ASCII to Return
i   USHORT,AUTO
j   USHORT
    CODE
    LOOP i=1 TO SIZE(inAscii)
        W[j+1]=inAscii[i]           !Little Endian ASCII Val in Byte 1
        W[j+2]=CHR(0)               !                 + Zero as Byte 2
        j+=2
    END
    RETURN W                                !Clarion 12 assign Unicode = ANSI does this, plus handles Code Page
!------------------------------------------------------------------------------
HexDump PROCEDURE(STRING inUtf, BOOL BigEnd=0)  !Utf 16 in Hex formated for Message() to Debug and verify code, to see Surrogates
Dmp   ANY
X     LONG
Wide2 STRING(2)
HHHHs STRING(5)
    CODE
    LOOP X=1 TO SIZE(inUTF) by 2
        Wide2 = InUTF[X : X+1]
        CASE Wide2
        OF '<0,0>'  ; BREAK                     !Null is the End, don't show 0000
        OF '<10,0>' ; Dmp=Dmp & '|' ; CYCLE     !No need to see 0A00 for Debug
        END
        HHHHs=CHOOSE(~BigEnd, HexS1(Wide2[1])&HexS1(Wide2[2]) , HexS1(Wide2[2])&HexS1(Wide2[1]))
        CASE IsSurrogate(Wide2)
        OF 1 ; HHHHs[5] = '+'       !Surrogate 1 aka High with +Low instead of Space
        OF 2 ; HHHHs=lower(HHHHs)   !Surrogate 2 aka Low in lower case
        END
        Dmp = Dmp & HHHHs
    END
    RETURN Dmp
HexS1 PROCEDURE(STRING InStr1) !,STRING     !HEX digits of 1 byte String
In BYTE,OVER(InStr1)
Hx STRING('0123456789ABCDEF')
  CODE
  RETURN Hx[BSHIFT(in, -4) + 1] & Hx[BAND(in, 0FH) + 1]
!------------------------------------------------------------------------------
MsgBoxAPI  PROCEDURE(STRING sText, STRING sCaption) !RTL Message() stopped showing all the content
cText       CSTRING(SIZE(sText)+2)
cCaption    CSTRING(SIZE(sCaption)+120)
xPipe       LONG
    CODE
    cText=CLIP(sText)
    cCaption = CLIP(sCaption) &'...(Text on Clipboard or Ctrl+C) '& All('.')
    LOOP
        xPipe=INSTRING('|',cText) ; IF ~xPipe THEN BREAK.
        cText[xPipe]=CHR(10)
    END
    ! SETCLIPBOARD(sText) ; Message   (sText,cCaption)       !Clarion, but original sText so has |||
    SETCLIPBOARD(cText)             !PAste this in an Editor for better screen capture
    MsgBoxAnsi(,cText,cCaption)
    RETURN

The Clarion RTL should have some of these. We should have functions for Surrogates. As I mentioned in my post to Z and above that I would like Source Code to allow UTF-32 code points like it is allowed in C# VB.net C++.

2 Likes

Also in the example is the borrowed French word Cafe` with an Acute Accent Diacritic Mark done 2 ways:

First with one code unit U+00E9 as a Precomposed Character named an “e with Acute Accent”.

Second with two code units a normal “e” 65h plus U+301 the Acute Accent Combining Character. It takes two code units, but renders as the same single “e`”. This is not a Surrogate Pair. There can be more than 2 Combining Characters, some say the limit is 32. Tibetan can use up to 8.

Precomposed versus Decomposed characters is a complicated subject so I’ll leave it to you to read the links above, plus the topic of Normalization.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode, they were included for compatibility with early encoding systems such as the various components of ISO 8859 or other kinds of “extended ASCII”. More recent Unicode policy has been to resist creation of new precomposed characters if the character can be produced using combining forms.

This webpage has a tool that lets you add and remove accents from characters. It’ll also let you copy the different normalization forms or the HTML entity encoded form of text with accents or other special characters.

1 Like