StrPos (RegEx) gives you the start character - how do you get the last character?

anon23294430 · June 20, 2022, 9:12pm

StrPos (RegEx) gives you the start character - how do you get the last character of the pattern match.

To me it seems half completed.

TIA

Edit.

I should clarify, I dont mean the end$ regex pattern match, which search engines interpret the question as.

vitesse · June 21, 2022, 12:02am

Hi Richard

if you have StringTheory have a look at FindMatchPosition which returns the start and end position of the match.

If you don’t have ST then using strPos, given you have the start character, what you can do is do the strpos() in a loop on a slice of the string starting with endPos at startPos and incrementing by one until you get a match (ie. until strpos returns a non-zero value) and that will be your shortest match.

actually here is an implementation by Michael Ware from 2004:

https://www.icetips.com/showarticle.php?articleid=418

well Michael’s implementation returns the length not the endPos but similar idea.

anon23294430 · June 21, 2022, 9:25am

I dont have any addons which is why I’m been writing my own templates, I’m already using strpos and the article wouldnt work because I’m using it for api/procedure/template command calls which means having to handle nested calls like
clip(left(loc:var))
or nested api calls.

I think I’ll just crack on with trying to figure something out as its a little more involved than the usual strpos/match/regex examples seen online.

vitesse · June 21, 2022, 12:52pm

Hi Richard

sorry I don’t understand what you mean by not being able to use the referenced article.

if you can use strpos then you can just as easily use a function similar to the one shown that takes your regular expression and your text and returns the start and end position rather than just the start position returned by strpos.

the only difference was Michael’s code returned the string length rather than end position, but end position = (startPos + length - 1) so you can easily translate it.

I think I must not be understanding what you mean so maybe explain what difficulties you are having and I will see if I can help more.

cheers

Geoff R

anon23294430 · June 21, 2022, 5:41pm

That breaks on the 2nd character of the pattern match string’s I’m using here, otherwise it might have worked.

anon23294430 · June 21, 2022, 11:02pm

regex pattern = ‘#[a-zA-Z]+’
Use on any template line of code and that example will fail, but I have got some different code working which returns the last character, which like you point out is just the length.

Edit.
This is the solution.
Pass the string
passedstring = Clip(Left(passedstring))
add ^ to the front of the regex pattern string
add $ to the end of the regex pattern string.
Loop
add one character at a time.
If a match
If the last char is not a space
add 1 to length.
Endloop
return.

Edit2

This is an improvement because you dont have to do the clip(left(passedstring)) and add ^ to the front of the regex.

So now if a strpos is found, pass the Cstring, String or Pstring and the regex (again a cstring, string or pstring) to this function and it will return the length.

It still uses the RegEx “$” tacked onto the end of the RegEx pattern to return 0 when it starts to encounter characters not part of the RegEx pattern match.
Because space’s <32> get included in the RegEx pattern, there is the separate IF statement to SUB the last char and reject it if its a space.

If however a space is needed, then remove that IF NOT Sub bit.

Written so it should be easy to see what is going on in the debugger.

StrPosLen  Procedure(? pLine,? pRegEx)

!? take both string and cstring.

Loc:Line            &Cstring
Loc:LineLen         Long
Loc:RegEx           &Cstring
Loc:StartPos        Long
Loc:LoopCnt         Long
Loc:SubString       &Cstring
Loc:SubStringPos    Long
Loc:Len             Long



    Code
    Loc:LineLen         = Len(pLine) + 1
    Loc:Line            &= New(Cstring(Loc:LineLen)) !Make it the same size as parameter pLine
    Loc:Line            = pLine
    Loc:RegEx           &= New Cstring( Len(pRegEx) + 2)
    Loc:RegEx           = pRegEx
    Loc:SubString       &= New Cstring (Loc:LineLen)

    Loc:StartPos        = StrPos(Loc:Line,Loc:RegEx,1) !NOT Case Sensitive
    Loc:RegEx           = Loc:RegEx & '$'

    Loop Loc:LoopCnt = 1 to Loc:LineLen
        Loc:SubString       = Sub(Loc:Line,Loc:StartPos, Loc:LoopCnt)
        Loc:SubStringPos    = StrPos(Loc:SubString,Loc:RegEx)
        If Loc:SubStringPos
            If NOT Sub(Loc:SubString,Loc:LoopCnt,1) = ' ' !Space <32>
                Loc:Len = Loc:LoopCnt - (Loc:SubStringPos - 1)
            End
        Else
            IF Loc:Len !Aint gonna get any bigger now
                Assert(Loc:LoopCnt < Loc:LineLen)
                Break
            End
        End
    End

    Return Loc:Len

vitesse · June 22, 2022, 11:11am

OK I think I can see where the confusion is. That code of Michael’s aims to give you the SHORTEST match.

so if your regex was ‘#[a-zA-Z]+’ and your line of code was ‘#ATSTART’ that would return a length of 2 as it matches with ‘#A’ but you want it to return the LARGEST match which in this case is the full length of 8.

is that correct?

vitesse · June 22, 2022, 11:50am

Hello again Richard

a couple of comments re your code.

whereever you have a NEW you should have a corresponding DISPOSE before returning to avoid a memory leak.

I feel at least some of those NEWs are probably not needed anyway and probably slowing things down. You should be able to work on the passed strings directly using string slicing. I would tend to just pass in strings (and if speed is important try to do so by reference rather than value).

Re your line:

Loc:RegEx = Loc:RegEx & ‘$’

you should probably only append a $ where the last character in the regex is not $

try to avoid unnecessary sub() For example

If NOT Sub(Loc:SubString,Loc:LoopCnt,1) = ’ ’ !Space <32>

could be more simply and efficiently stated as

if Loc:SubString[Loc:LoopCnt] <> ’ ’ ! space <32>

provided you are sure that Loc:LoopCnt is within the bounds of your string

or even more simply

if Loc:SubString[Loc:LoopCnt]

but to be honest I don’t think I understood what you meant about needing to check for the space - maybe you can explain more, thanks.

anyway just some thoughts and hope that helps

Geoff R

vitesse · June 22, 2022, 1:14pm

Hello again Richard

I just did a version with some of the suggestions

I deliberately did not use StringTheory as you don’t have it and I must admit it did feel a little like coding with one hand tied behind my back

cheers again

Geoff R

StrPosLen            PROCEDURE  (string pText,string pRegex)
x        long,auto
max      long,auto
len      long
stPos    long
regex    &string
  CODE
  !if ~address(pText) then return 0. ! uncomment this line if you decide to pass pText by reference instead of value
  if size(ptext) = 0 or size(pRegex) = 0 then return 0.
  stPos = strPos(pText, pRegex) ! get start position
  if ~stPos then return 0. ! no match
  if stPos = size(pText) then return 1. ! match on last char

! the following would be easier using StringTheory:
!  st.setvalue(pRegex)
!  if not st.endsWith('$') then st.append('$').
                                   
  if pRegex[size(pRegex)] = '$'
    regex &= pRegex  ! point at passed regex
  else
    regex &= new String(size(pRegex)+1)
    regex = pRegex & '$'
  end

  max = size(pText) - stPos ! max increment size
  loop x = 1 to max
    if strPos(pText[stPos : stPos+x],regex) 
       len = x + 1
    elsif len
       break
    end
  end
  if address(regex) <> address(pRegex) then dispose(regex).
  return len

anon23294430 · June 22, 2022, 1:16pm

Yeah I know, its one of the reasons why I’m writing this app, to pick up mistakes, change code etc etc.

I can do that for the string I’m looking for a pattern in because I dont need to change it, but I cant do that for the regex that is passed because I cant add anything to the passed regex, the size is fixed, which is why I do
Loc:RegEx = pRegEx
Loc:RegEx = Loc:RegEx & ‘$’

Yep, I know about that, its one of those, hmm I could improve that after I’ve posted the code here.
If NOT Sub(Loc:RegEx ,Len(Loc:RegEx),1) = ‘$’
Loc:RegEx = Loc:RegEx & ‘$’
End

I like to see whats going on in the debugger or debugview, not only that in this instance, I havent gone through all the characters that could also match in strpos if any more exist, so there might be a need for other character checks besides the space character.

I’m not a fan of string slicing because I like to see whats going on in the debugger or debugview.

So the regex ‘#[a-zA-Z]+’ detects the word portion of template code in a source file excluding the #!, #$, #<, #? functions.
So if I have line of template code
#elsif(%somefunction)
the regex above includes the (%somefunction) part in the length, so I need to add the $ to the regex ‘#[a-zA-Z]+$’ to get just the #elsif portion. You see because its building the line one char at a time in the loop, once it hits the ( it aborts.

If the line of template code is like below with x number of trailing spaces
#else
the regex ‘#[a-zA-Z]+$’ matchs the trailing spaces <32> for some reason so I need the line of code below to detect the spaces and ignore them.
If NOT Sub(Loc:SubString,Loc:LoopCnt,1) = ' ' !Space <32>

I could put a break in there but I sometimes need the space in a character set so this example isnt finished for me, I still need to go through the various regex’s I have.

I’ve started use regex’s in more code now, like I have various regex’s that make it possible to copy and paste windows api’s & data structures off the MS website and then convert it into a clarion format ready to paste into an app. A few mouse clicks and the works all done. Very Quick!

anon23294430 · June 22, 2022, 1:25pm

I’ll check it out because this regex is just a tiny part of a bigger problem I’ve been working on for over 8 months, which has involved, writing apps, special templates and more…

For example, I still need to process the parameters, and they have additional requirements as you can see here.

anon23294430 · June 22, 2022, 2:47pm

It should return 8 (note1) because the + after the character set [a-zA-Z] means one or more, so
#A
#AT
#ATS
#ATST
#ATSTA
#ATSTAR
#ATSTART
are all valid matches for that character string
However the Loop Until makes it return the shortest match.

note1 This ignores the problem where the regex includes other trailing characters until the $ is added to the end of the regex (if its not present).

When $ is present at the end of the regex, strpos and match still include spaces aka <32> which is not valid and the space character is not included in the regex anywhere either. I do use space inside character sets and externally, here is an example, the space is between the two + signs.
Loc:PatternMatch = '[A-Za-z0-9_]+ +[A-Za-z0-9]+\[{{[0-9A-Z_]+}\];'

Here I’m using space after the first parenthesis and after the comma inside the parenthesis.
Loc:PatternMatch = '^\} [A-Z_]+{{, +[*A-Z_]+}?;'

Is it a bug, I probably would class it as a bug but others might not, but I’ve worked around it for now.

On this bit of code that you posted

  if ~address(pText) then return 0.
  stPos = strPos(pText, pRegex) ! get start position
  if ~stPos then return 0. ! no match
  if stPos = size(pText) then return 1. ! match on last char

I dont need this because its part of a bigger section of code, ie I’m already calling strpos to establish a match and so I dont need to test again inside the function.

However if I was writing an addon like stringtheory, I probably would want to include some checks to the passed parameters as its being called in isolation, but then I’d want to add some feedback to the programmer to let them know why its returned zero because the code that returns 0 doesnt tell me why its failed which is why I use debugview even in templates.

This is the thing with code, “there are so many ways to skin a cat”.

vitesse · June 23, 2022, 3:39am

Yes this is very true!

I did some tests and you are quite right about the spaces on the end. I am not sure if that is deliberate or is a bug in strPos.

  ans1 = strPosLen('#ENDAT','#[A-Za-z]+')
  if ans1 <> 6 then stop('strPosLen test 1 failed : expected 6 but got ' & ans1).
  
  ans1 = strPosLen('#ENDAT','#[A-Za-z]')
  if ans1 <> 2 then stop('strPosLen test 2 failed : expected 2 but got ' & ans1).
  
  ans1 = strPosLen('what#ever123','#[A-Za-z]+')
  if ans1 <> 5 then stop('strPosLen test 3 failed : expected 5 but got ' & ans1).
                
  ans1 = strPosLen('#ENDAT      ','#[A-Za-z]+')
  if ans1 <> 6 then stop('strPosLen test 4 failed : expected 6 but got ' & ans1).

doing these tests my earlier version of strPosLen fails on test 4 which you can see has six spaces after #ENDAT. This test returned 12 rather than 6 as the trailing spaces were counted.

I cannot readily see of a universal way to avoid this as sometimes you might have regex’s where you want to include trailing spaces.

So for now I have added an extra optional parameter which defaults to stripping off trailing spaces. If you have a regex where you want the trailing spaces counted then you would add a third parameter of false:

  ans1 = strPosLen('#ENDAT      ','#[A-Za-z]+',false)
  if ans1 <> 12 then stop('strPosLen test 5 failed : expected 12 but got ' & ans1).

The following code passes these five tests:

prototype :
StrPosLen Procedure(string pText, string pRegex, bool pExcludeTrailingSpaces=true),LONG !return the maximum matching string length

code:

StrPosLen  PROCEDURE (string pText,string pRegex,bool pExcludeTrailingSpaces) 
x        long,auto
max      long,auto
len      long
stPos    long
regex    &string

  CODE
  !if ~address(pText) then return 0. ! uncomment this line if you decide to pass pText by reference instead of value
  if size(ptext) = 0 or size(pRegex) = 0 then return 0.
  stPos = strPos(pText, pRegex) ! get start position
  if ~stPos then return 0. ! no match
  if stPos = size(pText) then return 1. ! match on last char
                                   
  if pRegex[size(pRegex)] = '$'
    regex &= pRegex  ! point at passed regex
  else
    regex &= new String(size(pRegex)+1)
    regex = pRegex & '$'
  end

  max = size(pText) - stPos ! max increment size
  loop x = 0 to max
    if strPos(pText[stPos : stPos+x],regex) 
       len = x + 1
    elsif len
       break
    end
  end
  if address(regex) <> address(pRegex) then dispose(regex).
  if len and pExcludeTrailingSpaces
    len = len(clip(pText[stPos : stPos+len-1]))
  end
  return len

hth

Geoff R

#Edit1 (30th October 2024) change “loop x = 1 to max” to “loop x = 0 to max”
#Edit2 (7th November 2024) I’ve just posted a much better solution to this problem (fixing issues with trailing spaces) at:

Bruce · June 23, 2022, 4:28am

For what it’s worth this can simply be declared as

Loc:RegEx Cstring(size(pRegEx)+2)

That way you don’t have to do the NEW, and you don’t have to remember to do the Dispose.

anon23294430 · June 23, 2022, 4:40am

I’ll give this a go because its not documented in the C6 or C11 help page on datatypes declarations like cstrings or strings.

The Size help page exists, and having just looked at it, I can see there is two examples. Bit of a sleeper function imo.

I had tried
Loc:RegEx Cstring( Len(pRegEx) + 2)
and whilst it compiles, in the debugger it just throws an access violation.

anon23294430 · June 23, 2022, 5:25am

I dont know, I’ve done so much on regex’s my brain is absolutely fried because there are so many other implementations in other languages, its like a treasure hunt and I hate treasure hunts.

But either way to answer my original question, it seems there has to be this two step process to get the length.
Step one usual pattern match without a $ terminator
Step 2, add $ terminator and build the string one char at a time from the position returned by StrPos in step 1.

vitesse · June 23, 2022, 6:59am

ha ha that reminded me of a famous quote from 1997 by Jamie Zawinski:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

anon23294430 · June 23, 2022, 7:52am

Well when I make that point, one example is that some languages have a directive which allows the regex to impose a length constraint, like the string can only be 255 chars long.

Yes it could be done in clarion by using [a-zA-Z] 255 times instead of one [a-zA-Z]+, but the directives in other languages make life easier and the regex’s smaller in size.

I sort of agree with some of the articles points, but I also disagree with some of the points, but this might because of regional differences of concepts that is found in different parts of the world.

In the past, I have ditched using regex for testing if an email is valid or not, simply by using nettalk to send a test email to the email address.

Once the email server confirms or denies the email address is valid, I never proceeded with sending the email body, thus aborting the process. Now that is exploiting the RFC method of communication for email, just like denial of service on webservers exploits the tcp communication by pausing some of the packets, which can also be used to tie up resources in firewalls and then make services crash in a firewall that then makes it possible to deploy some nasty payload.

Now that email method also happened to be good for tracking the continued existence of email addresses as a way to track the movement of people in an industry over time as they moved from one job to another.

That was until the EU came out with some clarification on email addresses in the old Data Protection rules/law which made it amount to illegal spying on people.

GDPR issues - Do work emails count as personal data? - Cognitive Law

This was something I wrote for a company back in the early 00’s, who had quite a big database of email addresses of people, mainly based in the UK, but also people from outside the UK.

MS Exchange/Outlook.com now makes it harder to track those people or even to lookup up email addresses. For example, I could scan like a spider or wget a webpage, parse names out of it, and then test different formats of email addresses with their email server in order to get someone’s email address. Even simply getting the full name of someone and their employees domain name was enough to figure out their email address and then spam them in years gone by.

Thats why even here, Find MPs - MPs and Lords - UK Parliament they use a variety of different email address formats, to reduce their spam however, when communicating with a politician, because they ask for your details as they can only work with their constituents, they are asking for personal data, by virtue of the political process and thus every MP in effect becomes a data controller and I dont think they are exempt from this role because they represent so many people.

So far my Foreign Secretary MP hasnt acknowledged my GDPR request that I submitted last year and could become a little political nuisance!

I got Andy’s personal email address using telnet because he doesnt list his personal work email address on his website Contact Us – noyantis but like I say, MS Exchange/Outlook.com is getting hot on that method as well, so it doesnt always work, but it also shows the level of data MS have on people around the world, in order to “combat spam”.

Even MS, facebook & Google havent complied with my GDPR requests.

Anyway thats off the point and Carl will be along pretending to be Russ.

anon23294430 · June 23, 2022, 9:36am

So I just tried this, Size() doesnt work with an ? (any) parameter, only defined data types like string, cstring passed by value or address work with it

So using NEW is the only option if I want to continue to use the ? (any) data type, which I’m happy with because it can handle more datatypes, (cstring, string, pstring, astring) and might be ready for Clarion Ustrings out the box without any code changes.

So considering the slower speed of the heap, whilst also remembering DDR5 is now the new standard, I think the drop in speed using the heap instead of the stack with todays hardware standards is not worth worry about in this situation.

However I might change my mind when I’m downloading github and porting it all to Clarion.

vitesse · June 23, 2022, 11:30pm

I don’t have time to test right now but if you just use STRING passed by value, won’t Clarion’s automatic type conversion do its magic if you pass cstring, pstring, astring etc?