New Line in StrPos - is it ascii linefeed <10> or Carriage Return <13>?

RchdR · October 30, 2024, 4:07pm

Well its not
<10>abc
<13>abc
<13,10>abc
for search strings with a regex using just the period because Strpos returns 1 everytime and not 2,2,3 for each search string.

Loc:searchstring string(100)

=
‘abc<10>xyz’
‘abc<13>xyz’
‘abc<13,10>xyz’

Respective regex patterns
‘abc.xyz’
‘abc.xyz’
‘abc…xyz’
Always returns 1

So any ideas what this newline char or string sequence is?

Edit
Using same above searchstrings with regex pattern [^abc], strpos returns 4 for each search string.

It doesnt like there is a newline character(s) as per the docs…

Bruce · October 31, 2024, 5:50am

Unicode is a mapping, not an encoding. Unicode doesnt use any number of bytes its just a match of numbers to symbols.

For use in computers there are multiple different encodings that can be used. When you talk about how many bytes per code point you are talking about the encoding, not unicode.

Regardless of the encoding (utf-8, utf-16, utf-32, UCS-2) there are no ‘fixed number of bytes per character’ since characters can be multiple code points long.

UCS-2 has 2 bytes per code point, but cannot represent all the Unicode mappings, so is obsolete.

Utf-8 and utf-16 are dynamic length using 1 to 4 bytes per code point. Utf-32 uses 4 bytes per code point.

RchdR · October 31, 2024, 11:22am

Lost already…

I’ll cross the unicode bridge when I get to it…

RchdR · October 31, 2024, 11:37am

@vitesse this code has the same problem as strpos, namely its clipping the string of trailing spaces.

Try this
Loc:ptest string(‘2<32><32>’)
Loc:pregex string(‘.<32><32>$’)

The regex will match any char followed by two spaces, and the last space needs to be the last char in the search string.

Strpos returns 0 when it should return 1.

Loc:ptest string(‘<32>2<32><32>’)
Loc:pregex string(‘.<32><32>$’)

Here strpos should return 2 but it returns 0 which is wrong.

Sidenote. If your code is on github, its poisoning the LLM’s because they cant spot bugs in programming languages.

This is also a good example of why built in unit tests are good, because they can spot problems, and then either throw an error or run additional code to mitigate the bug. I dont see LLMs handling bugs in programming languages, but Im now also wondering if I could strpos clipping trailing spaces to inject some code or use the regex to do similar. Most coders will shy away from regexes in any code they have to work with, proof being this bug has existed for at least 20yrs in clarion, but it provides cover to some potential hacks, if I can make them work. If I can, then I would also wonder if this bug exists in other languages and for how long they have existed in those other languages…

vitesse · October 31, 2024, 11:36pm

you were right Richard. I have changed the code above to delay getting the start position until AFTER the space substitution has been done, due to the clipping in strpos as you point out. So it passes your two tests now. Let me know if you have any further problems - I agree with your comment about having unit tests - I have added these two tests to my informal tests.

note this code still will not work reliably doing the space replacement where there are ranges like A-Z as noted yesterday. I am thinking of doing another version to get around that, although I think I will need to use StringTheory as coding anything complex without it is (to me) a PITA and waste of precious time.

RchdR · November 1, 2024, 2:18am


StrPosAndLen         PROCEDURE  (string pText,string pRegex,*long pMinLen,*long pMaxLen)

! (c) 2024 Geoff Robinson Vitessegr at gmail dot com  
! 29 October 2024
! released under the MIT License https://opensource.org/license/mit
!
! 1 November 2024 - move check to get start position to be AFTER space substitution
!                 - note: this code does not cope with ranges when replacing spaces!! 

x     long,auto
max   long,auto
b     byte
c     string(1),over(b) ! single char over b
stPos long              ! start position (return value)
dollarEnded string(size(pRegex)+1)                          
regex &string 

  CODE
  pMinLen = 0; pMaxLen = 0            ! initialise/clear
  if size(pText) = 0 or size(pRegex) = 0 then return 0.

  if instring(' ',pText) or instring(' ',pRegex) ! includes space so need to replace spaces if possible
    loop b = 255 to 1 by -1
      if instring(c,pText)            then cycle.
      if instring(c,pRegex)           then cycle.
      if instring(c,'^$.[]|{{}*+?\-') then cycle.   ! avoid special regex chars
      break
    end
  end

  if b   ! if we have a replacement char for space then do replacements 
    loop x = 1 to size(pText)
      if pText[x] = ' ' then pText[x] = c.
    end
    loop x = 1 to size(pRegex)
      if pRegex[x] = ' ' then pRegex[x] = c.
    end
  end

  stPos = strPos(pText, pRegex) ! get start position
  if ~stPos then return 0.      ! no match
  if stPos = size(pText)        ! single char match
    pMinLen = 1
    pMaxLen = 1
    return stPos
  end

  if pRegex[size(pRegex)] = '$' and sub(pRegex,size(pRegex)-1,1) <> '\' 
    regex &= pRegex
  else
    dollarEnded = pRegex & '$'
    regex &= dollarEnded
  end                         
 
  max = size(pText) - stPos     ! max increment size
  loop x = 0 to max
    if strPos(pText[stPos : stPos+x],regex)
       pMaxLen = x + 1
       if pMinLen = 0 then pMinLen = pMaxLen.  
    elsif pMaxLen
       break
    end
  end
  return stPos  ! return starting position
!----
#Edit1 added '-' in regex chars to avoid "if instring(c,'^$.[]|{{}*+?\-') then cycle."
#Edit2 moved check to get start position to be AFTER space substitution
#Edit3 needed to move code to append $ to regex to be AFTER check start pos

I dont see what c is?
Is c supposed to be char(b) ?

I see the code below looping doing a ptext[ x ] = c and pregex[ x ] = c, but the instrings with c, I just dont see what c would be.

Without giving away my code, thats the wrong approach. You have to analyse the regexs, work out what they are, its complicated stuff.

And all this effort and round robin stuff is all just because the search string is clipped.

Its a make work exercise aka resource burn, plus its a way to phish coders abilities and its a way to keep coders from using some tech which is a very useful and productive tech, but would give some military squeaky sphincters when looking at the whole clarion package, to name just a few ways to look at this.

And then when we look at Ai generating code, authorities including militaries have the perfect deniable backdoors into systems because the coder just cut n pasted the code. Even Ive been guilty of that, but does the $billion Ai market deter me from getting a better way to program out to the masses? No.

Its hard not to be cynical, if I’m honest.

And at that point I see you replying so Im logging off…

vitesse · November 1, 2024, 2:38am

Hi Richard

see the definition earlier in the code:

b     byte
c     string(1),over(b) ! single char over b

it is my common way of doing these things when parsing.

so ‘b’ stands for byte and ‘c’ stands for char (ie. string(1)) but they are at the same address so yes c is effectively chr(b) but without the overhead. Hope that is clear.

well everyone has different ways of doing things but “different” doesn’t always equate with “wrong”.

yes I agree. But remember it was your idea to substitute a different character for space - and I think it was a good idea although it turns out to be a little more complicated than (I at least) initially expected due to ranges etc.

I think it is really just a strpos() bug and there is nothing sinister lurking there.

cheers

vitesse · November 1, 2024, 3:19am

I realise we have, as is often the case, gone off topic from the original question about newline.

FWIW the docs in C11 are more or less the same as what Paul has shown from C6, although formatted differently.

to answer the original question, I tested using the ‘.P’ and ‘U.A’ shown in the docs:

x      long
 code

  loop x = 0 to 255
    if strPos(chr(x) & 'P','.P') <> 1
      stop('strPos did not match .P on ' & x)
    end
    if strPos('U' & chr(x) & 'A','U.A') <> 1
      stop('strPos did not match U.A on ' & x)
    end
  end

what it shows is that the docs are wrong.

the ONLY char that does not match is chr(0)

I suspect this is because somewhere internally they are using a cstring and so the null char is interpretted as “end of string”. So ignore the bit about “newline” but remember to be wary of embedded nulls.

RchdR · November 1, 2024, 4:55am

Yeah I see what the code is doing now, looping trying to find a char thats not used then replacing space with that unused char.

Problem is, although very very rare in practice, all chars could be used so then the need to analyse the regex and modify it becomes a must, otherwise it fails and thats what Im doing.

I already have a q that tells me how many times the ascii code has been used in the search string and regex because you dont want to swap space for a ascii code that is being searched for in the the regex. That also means calculating if the ascii code is used in a regex range. Thats why your code still wouldnt work, the ranges could catch it out.

I think have devised a reasonable methodology for working out these regexes. I used it to do the picture validation in the template builder to make sure a valid picture token is keyed in and its working for the regex analysis. The methodology makes doing regexes so simple.

RchdR · November 1, 2024, 4:59am

<10>, <13>, <13,10> dont affect strpos. In fact no ascii code affects strpos except <32> where its automatically clipped from the search string.

I havent decided if Im leaving that code in as a series of unit tests, just in case strpos works differently on other peoples computers.

RchdR · November 1, 2024, 5:48am

It might be a genuine wtf moment, and whilst I wouldnt use the word ‘sinister’ because law abiding people never are, its exploitative, manipulative and stealing peoples lives!