StringTheory - Which methods can accept RegEx notation and any RegEx methods per se?

RchdR · October 18, 2024, 10:23am

Following on from @vitesse post which showed keepcharranges using a-z as a regex range, I wondered what other regex functionality exists.

So looking through the docs I see a few like StringTheory Complete Documentation

https://www.capesoft.com/docs/stringtheory/StringTheory.htm#Match

https://www.capesoft.com/docs/stringtheory/StringTheory.htm#FindMatch

But I wondered if the methods could be grouped so they stand out as ‘these are the regex/strpos methods & properties’?

Reason I ask is Im just currently merging a couple of regex functions into one and if the functionality it provides exists in StringTheory already, no point in reinventing the egg.

My function

StrPosProcess (**? pSearchString, ? pRegEx, Long pMode=0, Long pStrPosBitmaskOptions=0, Long pInstance, <? pReturnValue>), Long !Long = Returned Errorcode

Pertinent Bitmask equates, so not all.
!Just started adding this yesterday.
RegExLeftToRightProcessing
RegExRighToLeftProcessing

!alters the passed regex pattern match string
RexExMatchNoLeadingSpaces
RegExMatchNoTrailingSpaces
RegExMatchNoLeadingTrailingSpaces

!
ReturnLen !len of the regex match
ReturnMatch !returns the matched string
ReturnLeading !returns leading string before match start
ReturnTrailing !returns trailing string after match end

! Works with above returnlen and returnmatch
ReturnMinSize !returns the smallest matched string size found
ReturnMaxSize !returns the largest matched string size found

Theres other bitmask options, but this function also lets you specify the pInstance number so 1st or Nth match in a string.

It works well in their StrPosLen/StrPosMatch forms but theres so much code overlap Im merging these two functions back into one.
But if StringTheory does this for me, I might just get that, hence the question?

Im using these function(s) to parse txa’s and other programming language source files, and it generally works.

The hardest part is getting the regexes right, but when it works its so easy to validate source code but returning the matched string is key to making progress with regex pattern matches.

vitesse · October 18, 2024, 12:11pm

Hello again Richard

ST has a number of regex type methods and you have already identified some of them.

I won’t bother saying what these do, as that information should be readily available in the docs, but FWIW here is a list:

Match
FindMatch
FindMatchPosition
SplitByMatch

I agree that it would be a good idea if Bruce were to list these (and any others I may have forgotten) under the Regular Expressions section of the docs.

As mentioned earlier at Need help with some text functions - #40 by vitesse I don’t consider KeepCharRanges and RemoveCharRanges to be regex even though it has the ability to deal with character ranges such as:

‘A-Za-z0-9’
‘<0>-<31>’ and so on.

I think your StrPosProcess has different functionality compared to what ST offers, but it would most likely be easier to write your function using ST compared to straight Clarion as it allows you to think at a higher level and not worry about the low-level details where “off by one” errors often occur. And the existing ST regex methods would likely get you some of the way there.

For example dealing with leading or trailing spaces is trivial, and the start and end position are available using FindMatchPosition so from those again it is a simple matter to get the string before and after. Mind you, FindMatch and FindMatchPosition return the minimum matched size, so you would have to do further checks if you wanted to get the maximum match - basically just try one more char then another in a loop until the regex no longer matches.

Aside: you may recall we did discuss this back in June 2022, see:

ST has a similar idea in the method FindNth and its various related methods like AfterNth, SetAfterNth, BeforeNth, SetBeforeNth, ReplaceNth, but these are for simple strings not regex.

I have done a fair bit of that kind of work in the past and find ST invaluable, especially the various Split options.

cheers for now

Geoff R

RchdR · October 18, 2024, 2:11pm

I cant find these in the online docs, am I looking in the right place?

https://www.capesoft.com/docs/stringtheory/StringTheory.htm#StringTheoryMethods

June2022 has it been that long?

The off by one errors arent really an issue, its when it decides to include a space at the front or back thats an issue. And I dont know if its a bug/hack/or something else. Its random when it happens and annoying. I use cstrings for the bulk of the work instead of strings but when it plays up its like the cstring is behaving like a string. This is in c6 XP and c11 10 and 11. See the edit below.

For example, I’ve got all the regexes that validate every clarion picture combo , strings, numerical, dates, time, pattern and key pictures etc going buts thats a series of regexes working with the alternate | symbol which saves me having to write clarion if then or case statements.

So Im not having to write as much clarion code now.

ST would handle importing encoded files but thats not an issue with powershell now because Ive added " -encode ascii " to out-file commands.

Im not doing any web work at the moment so I dont have any need to move data quickly from/to utf8 to whatever, but Ive already rolled my own fuctions that ST does alot of.
Ive even got my own unicode functions, just not put any of it into a class yet for added convenience.

I was just wondering if ST had some purpose built regex functionality, but apart from range checking it doesnt look like it.

Anyway thanks for letting us now.

Edit.
So still working on this regex string problem.

So the regex pattern is in a cstring long enough for the pattern only, its not in a string with trailing spaces in the left to right processing.

So in the debugger I see the regex pattern in the cstring, but when the regex plays up it behaves like a regex pattern stored in a string. So its as if I’m looking at a cstring OVER’ed a string in the debugger and strpos is using a string.

I cant see the string the strpos is using and I cant even see this in assembler in the debugger, thats hidden away even from the debugger.

But this is preciously what the windows Address Space Layout Randomisation (ASLR) is supposed to prevent, but thinking about I havent checked to see if ASLR is on in 10 and 11, it certainly wont be in vmware xp running C6.

But thats how the regex pattern is behaving when it plays up, it behaves like its in a string and not a cstring.

vitesse · October 19, 2024, 4:37am

that’s the docs for an earlier version. You want the ST3 docs:

https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#StringTheoryMethods

can you please provide an example of a regex and data where it is not working so I can check it out. I must admit I don’t use cstrings myself except when I have to interface with some third party API that uses them, but I realise many people like them despite their inefficiencies

RchdR · October 19, 2024, 11:15am

Google directing me to ST2. Didnt even know there’s a ST3 version now and the webpage txt is small on a mobile.

Powershell -command “$psversiontable.psversion | out-file -encoding ascii -filepath ‘c:\some folder\some filename.txt’”

That creates an ascii file with the powershell version info.
Line1 is blank
Line2 Major Minor Build Revision
Line3 ----- etc
Line4 2 0 -1 -1

Using ascii driver read lines and copy into a cstring then regex the lines.

First regex for line 3 is
{{{{ }+}?{{.}+{{ }+{{.}+{{ }+{{.}+{{ }+{{.}+

Which is optional one or more leading spaces.
One or more any char except crlf
One or more spaces
One or more any char…
One or more spaces
One or more any char…
One or more spaces
One or more any char.

The powershell line is basically number then trailing spaces for the column width of each of the four segments with a space between each segment, so revision column is number then spaces then crlf.

That regex works fine because I cant predict what future changes may be made to powershell and whether they will/have introduced letters into any of their versions. The minus sign for powershell 2 ie 2 0 -1 -1 is the starting version of powershell on xp for example.

Anyway the above regex is just a catch all chars making sure they are in the 4 column layout with an optional handler for any leading space before the major column.

I then loop through the searchstring adding 1 char at a time with these regexs for each column
Major ^{{[1-9][0-9]?}$
Minor ^{{[1-9][0-9]?[0-9]?[0-9]?}$

So Major can be 1 or 2 numbers, 1st char has to not a zero number, 2char can be full range and yet it matches all the way upto the column minor. Its including the trailing spaces after 2 or 10. 10 is powershell on win11.

So I have to introduce a [^<32>] before the $.

Minor is 1 to 4 numbers where 1st number has to be not zero.

Cant do an app as Im mobile whilst trying to write an app to automate the lockdown process after installing windows onto the laptop before it goes back online. This also assumes there isnt malware or something Ive copied over from backup whilst reinstalling my baseline of apps, like notepad++, c6,c11, vmware, etc etc.

Edit.
If you want a hack to independently test the regexes, remove the curly braces and use the regex in notepad++ in their regular expression search file, with the txt or tps loaded, or use the c6 c11 search facility which also does the same regular expressions as the clarion apps so no need to remove curly brackets. This is how I know the regexes are correct and the problem is somewhere else in my app, but I cant find it!
Notepad++ highlights the entire match, c6 doesnt cant remember about c11.

Edit2.
The other annoying thing is I have this strpos match and strpos len code working perfectly in my template builder using the template langage to validate prompt inputs and other bits, only thats using strings as thats all thats supported along with longs and reals.

vitesse · October 19, 2024, 12:38pm

Hi again Richard

I am sorry but while I can see what you are doing with the regex’s, I am no clearer as to what the actual problem is.

There is a saying along the lines that if all you have is a hammer then you treat everything as a nail. I am wondering if because you are so confident/proficient with regex’s that you look for opportunities to use them when perhaps they are not the best solution?

A related thought is Whorf’s Hypothesis which essentially says that the language you speak shapes your thoughts. Examples are often given about Eskimos having x words for “snow”.

OK so whereas you speak in regex, I speak in StringTheory and so our approaches to something like this will obviously be different. Chances are you would prefer your way and I mine. That’s fine and almost a given.

But if you will indulge me a little, I will show how I might approach this example using ST, if only to show an alternative approach.

st  StringTheory
lne StringTheory

  code
  if not st.loadFile(‘c:\some folder\some filename.txt’) then <error handling>.
  st.split('<13,10>')                ! split into lines
  lne.setvalue(st.getLine(2))        ! get second line
  lne.trim()                         ! get rid of spaces at start and end of line
  loop while lne.replace('  ',' ').  ! multiple spaces to one space
  lne.split(' ')                     ! split into columns
  if lne.records() <> 4 then <error handling>. 
  major = lne.getLine(1)
  minor = lne.getLine(2)
  build = lne.getLine(3)
  revision = lne.getLine(4)

anyway I suspect you have such an investment in your regex code that you don’t want to alter course now, but it never hurts to see other approaches. And perhaps I have missed your point entirely!

I get that you are saying somewhere a cstring is being treated as a string, but I am not clear where and what the exact problem is.

anyway cheers and good night from Down Under.

RchdR · October 19, 2024, 4:18pm

So your treating it as a csv import, where the comma is space and theres no validation of the data in each field.

lne.split(' ')

I get what you are saying if all you have is a hammer, everything is a nail, but the regexes give me a complex field level of validation.

This is only a simple example but if you add up all the lines of code to validate say a clarion picture, using the template language, thats 7 lines, a regex for date, key in, numeric and currency, pattern, scientific notation, string and time.

How would ST validate all the clarion pictures in 7 lines of code or less?

Bottom line is use the best tool for the job. I cant blindly import data without validating it and the regexes afford me a level of validation thats quick and requires little code, but has some quite complex levels of flexibility. What it lacks I can code in clarion to compliment the regexes. I have an app here importing template files and clarion source into txa format. Its not finished and it needs to import source code from other languages but doing this in just clarion code was too much of a resource burn, so regexes are relied on heavily to help with the importing, and then you have to work with the txa rules.

Re The problem, it seems to me to be one of two.

Either the cstring regex is losing/ignoring the terminating $ or its using a string and is still losing/ignoring the terminating $.

Thats the crux of the problem, but I can see the terminating $ in the regex field in the debugger, so its a mystery.

Ps. Not doing a barbee tonight with some amber nectar whilst not giving a XXXX?

vitesse · October 20, 2024, 8:36am

well that is one perspective. I see it as using split to slice and dice data into small manageable chunks. And yes CSV processing is also doing the same kind of thing - breaking into lines and then each line into fields.

there are often numerous ways to do things. So in that code I did split initially on CRLF which allows easy indexed access to each separate line. But in the end in this case I only used the second line, so I could have actually just used one ST object and not done the initial split.

  st.loadFile(....)
  st.setBetween('<13,10>','<13,10>')

however there is a risk that there is only one CRLF in the file (2 rows) and in that case this would not work. So safer to say:

  st.loadFile(....)
  st.setAfter('<13,10>')
  st.setBefore('<13,10>')

that gets you the second row into your ST object, then as an alternative to trim() and replacing multiple spaces to one:

  st.split(' ')
  st.removeLines()  ! get rid of blank lines
  if st.records() <> 4 then <error handling>. 
  major = st.getLine(1)
  minor = st.getLine(2)
  build = st.getLine(3)
  revision = st.getLine(4)

and sure you can do validation on each field if you wish, but you make a good point about doing it all in your regex - so it seems like a good way for you to go.

re the error you are getting: next time you experience it can you post it here so I (and/or others) can check. Another pair of eyes and all that…

Ha funny you say that - I was actually using the bbq but using it as a pizza oven for a change, with some new pizza stones. Works great. And while I don’t drink XXXX, I did have some amber nectar as I brew my own ales (generally dark ales or stouts). Quite a good hobby!

cheers

RchdR · October 20, 2024, 9:22am

Yeah, you see Im actually regex’ing the titles, then the dashes then the values. And I can do that as 3 records or one line. Its to make sure nothing else changes in the output, but if it does the program notes it and alerts a change has been detected rather than fail. Thats a simple test, try:
Powershell -command 'get-wmiobject win32_pnpentity | out-file -encoding ascii -filepath “c:\some folder\some file.txt” ’ and then parsing the resultant output using regexes to monitor changes in hardware.

The __PATH output is interesting, but my offline desktop showed me it was not offline. Dont ask me how…

Re error, Ive put some assert(strpos(),msg) in place as permanent unit tests but to catch in the wild might will rehash the assert to a log file.

Are the pizza stones any good, never tried. Im a weber kettle fanboi, even cooked xmas dinner on their largest kettle bbq. Got me out the house and peace and quiet for most of the day… Bit of a wiz at bbq’s, the trick is being a good firestarter…

vitesse · October 20, 2024, 10:14am

We are getting a bit off topic here - but hopefully we will be forgiven.

The stones work really well here but the trick is to get them up to 300+ C (approaching 600F) before you start and then do really quick change-overs so the temp doesn’t drop. I also use a Weber but mine is plumbed-in gas model (Genesis - https://www.heatgrill.com.au/product-category/barbeque/weber-premium-gas/weber-genesis-range-weber-premium-gas/) which I imagine makes it a lot easier to control temperature compared with charcoal grill models. But it might be worth researching - there seems to be lots of youtube videos giving enthusiastic instructions.

RchdR · October 20, 2024, 12:20pm

I like the smoke otherwise just cook in the kitchen. I’ll remember to preheat the stone.

Bruce · October 21, 2024, 12:28am

https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#StringPictureClass

pic   StringPicture
  code
  If pic.ParsePicture('@something') = st:notOk
    ! picture is bad
  end

This supports all Clarion pictures, and also supports StringTheory’s Exptended Pictures;
https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#ExtendedPictures

Cheers
Bruce

RchdR · October 21, 2024, 1:51am

Lol. I see what you did there.

So you have your class method overhead, at least 4 lines there and thats before the method code to test the picture however you are doing it.

That could be done a couple of ways, one involving regexes, one using value stuffing, format, deformat, does the resultant value match? If yes, its a valid clarion picture.

The latter could still be used in template code, but I’d have to use a dll wrapper in the template, so a few more lines of code there. I see your extra date pictures which show week number, and as Im aiming for template builder to generate code in other languages, learning regexes to handle picture formats where they exist in other languages became the logical conclusion, for me for short term and long term gain.

Even if I was using ST in a dll wrapper to validate template prompts, Id have to get you guys to build the pictures for other languages into ST. I can build a regex much more easily now which opens up new ways of coding strangely, but I think the regex still wins at reducing the lines of code.

Bruce · October 21, 2024, 2:13am

I’m not suggesting you use StringTheory. You asked how to validate the picture using StringTheory, and I was answering that.

The StringPicture class breaks down the picture into the component parts. This then allows for picture manipulation, and reconstruction. The validation is just a side-effect.