VitRegex - New Regular Expression (Regex) engine for Clarion

OK I have had a closer look and was interested in reading about “worst case quadratic time complexity”. I ran the initial example he gave using VitRegex:

? st.trace('---- recursion test 1 from https://iev.ee/blog/the-quadratic-problem-nobody-fixed/')
  st.setValue('bbbbbbbbbb')
  startTime = clock()
  matches = regex.matchAllRegex(st,'.*a|b')
  timeTaken = clock() - startTime
  if matches = 10
?   st.trace('10 matches found')
    passCount += 1
  else
    errCount += 1
    LogIt('recursion test 1 Failed: matches=' & matches & ' but expected 10' )
  end

but the time was barely measurable. Then I remembered he said

a 50-character input can take longer than the heat death of the universe . it’s exponential.

a quick search found that this heat death of the universe was

estimated to occur in roughly 10 ^ 100 years (a 1 followed by 100 zeros).

that sounded like a challenge. TBH I didn’t have that many years to wait around so hoped it was faster.

I remembered those old cartoons with a skeleton sitting in front of a VDU (remember those?) due to the sluggish response time.

I used the debug/tracing version of VitRegex (latest version 1.09) in order to see the backtracking and there were 2120 lines of tracing taking a total of just under 0.116 seconds.

vitRegex test tracing.txt (222.4 KB)

Hey the universe didn’t end!

What’s more, I am very clear in the manual to only use the debug/trace version when you are actually debugging some issue, because it is so slow (relatively speaking) - most of the time is spent outputting tracing information rather than getting on with the job at hand.

Anyway I re-did the 50 char torture test without the tracing:

1	0.000000	14400	TestRegex.exe	[st][1] ---- recursion test 2 from https://iev.ee/blog/the-quadratic-problem-nobody-fixed/
2	0.001495	14400	TestRegex.exe	[st][1] 50 matches found

and this time it took just under 0.0015 secs. Roughly 77 times faster. And that includes putting all 50 matches into the StringTheory results queue.

Anyway I was pretty (well, very) happy with that and it seems to indicate that all the time spent on coding optimisations in VitRegex is paying off (at least in this case - although the log did indicate it had fallen back to “Using standard exhaustive search” so that indicates even the basic exhaustive search without skips and jumps is fast).

Finally, I did notice that he mentions that RE# does not do either capture groups or lazy quantifiers. You can read the article to understand why they are not possible but suffice to say that VitRegex does both :grinning_face:

What colour? Orange, Green, Black and White?

Thats not particularly handy. So I was parsing the input string/file but I was also parsing the RegEx and expanding it, which I suspect you are doing as well to add additional functionality.

Still not looked at it. :roll_eyes:

Thing is if they are writing it themselves, surely they can build that in, I would have thought…

I think I worked on all of them back in the day but primarily green characters on a black screen.

no hurry - it is waiting there as another tool in your kit for when you need it.

Still, you might want to have a play with it some time.

Richard, Regex engines fall into two broad categories - back tracking and DFA (Deterministic Finite Automaton). Each has their advantages and disadvantages, and each has proponents, often with strongly held positions. You are probably familiar with FSM’s (Finite State Machines) so DFA’s are in somewhat the same direction. See below for some info from a google search.

VitRegex is backtracking, which makes it possible to have powerful facilities, and I have done a lot of optimisations to try to limit or mitigate the risks and make it as fast as possible (with two more optimisations still to do). Other regex engines that are DFA will often have a hard time with capture groups or backreferences or lazy quantifiers etc.

google search said:

The primary difference between a DFA regex engine and one that uses backtracking (typically NFA-based) is how they handle multiple potential matches: a DFA engine explores all possibilities in parallel and never backtracks, while a backtracking engine explores paths sequentially and “backtracks” to try a different path if one fails.

DFA (Deterministic Finite Automaton) Engines

Matching Process: DFA engines are text-directed. They read the input string character by character, maintaining all possible matching states simultaneously in a single pass. There is always a single, determined next state for each input symbol.
Performance: They are generally faster and guarantee linear time performance relative to the input string’s length, making them immune to Regular Expression Denial of Service (ReDoS) attacks.
Features & Limitations: DFA engines cannot support advanced features like backreferences (e.g., (a)\1), arbitrary lookarounds, or capturing groups because these features require memory to remember previous parts of the match, which conflicts with the stateless nature of a pure DFA.
Examples: Found in tools like awk, grep (GNU grep), lex, and Google’s RE2 library.
Backtracking (NFA-based) Engines
Matching Process: Backtracking engines are regex-directed. They try to match the pattern to the string, and if a path fails, they “backtrack” to a previously saved state to explore another alternative.
Performance: They typically run quickly for most common regexes, but can exhibit exponential time complexity in pathological cases with ambiguous or nested quantifiers, leading to potential “catastrophic backtracking” and ReDoS vulnerabilities.
Features & Benefits: The ability to backtrack enables support for powerful, non-regular features that developers often rely on, such as backreferences, capturing groups, and lazy quantifiers.
Examples: Common in programming languages like Perl, Python, .NET, Java, and PHP (PCRE library) DFA (Deterministic Finite Automaton) Engines

  • Matching Process: DFA engines are text-directed. They read the input string character by character, maintaining all possible matching states simultaneously in a single pass. There is always a single, determined next state for each input symbol.
  • Performance: They are generally faster and guarantee linear time performance relative to the input string’s length, making them immune to Regular Expression Denial of Service (ReDoS) attacks.
  • Features & Limitations: DFA engines cannot support advanced features like backreferences (e.g., (a)\1), arbitrary lookarounds, or capturing groups because these features require memory to remember previous parts of the match, which conflicts with the stateless nature of a pure DFA.
  • Examples: Found in tools like awk, grep (GNU grep), lex, and Google’s RE2 library.

Backtracking (NFA-based) Engines

  • Matching Process: Backtracking engines are regex-directed. They try to match the pattern to the string, and if a path fails, they “backtrack” to a previously saved state to explore another alternative.
  • Performance: They typically run quickly for most common regexes, but can exhibit exponential time complexity in pathological cases with ambiguous or nested quantifiers, leading to potential “catastrophic backtracking” and ReDoS vulnerabilities.
  • Features & Benefits: The ability to backtrack enables support for powerful, non-regular features that developers often rely on, such as backreferences, capturing groups, and lazy quantifiers.
  • Examples: Common in programming languages like Perl, Python, .NET, Java, and PHP (PCRE library)

I commend you on what you have done, and achieved.
I don’t use regex, mostly because I can’t bloody understand it :slight_smile:
It does, however, have its uses, and you have provided the community with what appears to be an excellent product.
Bravo and thank you!

2 Likes

I know what you mean, but when it clicks, you can write RegEx’s like you are writing normal words.

I know these are only basic examples.

RegEx inside a template #Validate

#Set(%StrPosResult,%StrPos(%Procedure,%RegEx,%TB2GlobalPropNoCaseSensitivity ))

Command line allowing \install /install & -install where install can be any case.

IF StrPos( GCL:CommandLineFlags, '^[\\|\/|-]install',True)

Template code to allow any space before PRE() with any amount space inside the parenthesis.

#Set( %PrefixPosStart, StrPos( Upper( %pVarDeclaration ), ',<32>*PRE(<32>*)' ) )

Clarion code looking for the word services.exe with no leading or trailing spaces in any case.

IF StrPos( GS:szExeFile, '^services.exe$', True)

If you need to use the curly brackets, just like you would use parenthesis in a maths formula, just remember the doubling up thats needed for the compiler, eg {{}.

Sometimes its also easier to put the regex in a string var and call that like in the first example or eg

strpos(Loc:SearchString,Loc:Reg,Loc:SetCase)

Browser check - note the backslash negating the special function of the characters, eg \\ and \. and this has to be last in the variable, I dont care about the preceeding path.

IF StrPos( GWLC:ProcessImageFilename, '\\iexplore\.exe$', True )
IF StrPos( GWLC:ProcessImageFilename, '\\firefox\.exe$', True )
IF StrPos( GWLC:ProcessImageFilename, '\\chrome\.exe$', True )
IF StrPos( GWLC:ProcessImageFilename, '\\msedge\.exe$', True )

Same as above but checking if the NT virtual DOS machine is in the command line.

StrPos( GWLC:CommandLine, '\\ntvdm\.exe$', True )

Here I’m checking for the existence of either WOWEXEC.EXE OR WINOLDAP.MOD and nothing more. Note the left doubled up curly brackets.

IF NOT StrPos( ppszModName,'{{\\SYSTEM32\\WOWEXEC.EXE|\\SYSTEM32\\WINOLDAP.MOD}$',True)

I definately use StrPos more than InString now, I cant find an example where InString is being used in my code.

In fact if you wanted to work out how good a programmer was, because StrPos can do everything InString does, if you saw instring being used and not StrPos that would suggest they cant use RegEx’s.

Its a useful (secret) metric for managers & recruiters for many programming languages, thats there in plain sight in public github repos.

thanks Sean! - it has been something in the back of my mind to explore for many years ever since I found shortcomings with Clarion’s built-in match() and strPos() way back last century.

yes that is a common problem as regex are often ‘dense’ and hard to decipher (and in some ways you are embedding another “mini-language” in your Clarion code) but look at the two references I gave earlier in this thread for regex101 and regExr which make things much easier to understand.

As Richard showed in his following message, once you get your head around regex they are really powerful. Of course there is the opposite opinion - Jamie Zawinski had the famous quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

see

and
https://www.reddit.com/r/xkcd/comments/1845fm/perl_problems/

In some ways it all comes down to familiarity, and with that, confidence. Once you start using regex you may well find more and more uses for them - but it requires a change of mindset. Remember Whorf’s hypothesis (aka linguistic relativity) which (simplified) suggests that one’s language shapes one’s thoughts - so it is with your available Clarion tools (whether built-in or third party). For example, we use Clarion queues without a second thought because we are so familiar with them - users of other languages might need to come up with other (perhaps less elegant) solutions.

7 posts were split to a new topic: Validating email addresses

Please try to keep this thread about VitRegex and relevant regex in general.

Version 1.10 of VitRegex is released:

There is a new dynamic array for longs included which replaced a queue that may be of interest even if you don’t use regex.

 - removed MatchesQueue from tryMatch (which is recursive and so a new
   queue was created at each level of recursion)
 - MatchesQueue was replaced by new Class VitDynaLong which is a dynamic
   array of longs which expands if and when needed.
 - this VitDynaLong class has simple set and get methods.
 - the new posArray (of type VitDynaLong) is at the VitRegex level of
   scope and so no new object is required for each call of tryMatch.
 - tryMatch altered to have an extra parameter startMatchIdx which is
   the number of entries in the VitDynaLong object *prior* to tryMatch
   being called.
 - further tryMatch and findPos debug tracing added (only visible in
   the debug/tracing version)
 - minor formatting changes in debugInfo

Version 1.11 of VitRegex is released:

 - lowered default maxRecursionDepth from 4096 to 3500.
 - on stress testing VitRegex I found that (for some currently unknown
   reason) there seems to be a *lower* stack limit in programs compiled in
   release mode compared to debug mode.  Note this is irrespective of the
   stack size set in the IDE under Project Options. (Hence I am reducing the
   default maxRecursionDepth to reduce likelihood of exhausting the stack
   when user compiles in release mode.)
 - remember you can alter the maxRecursionDepth in code if you think a
   higher or lower limit is required - see docs S4.1 and S6.9
1 Like

Version 1.12 of VitRegex is released:

 - changes in linkPatterns() to replace queue       (StackQ)
                                   with vitDynaLong (bracketPos)
 - minor formatting changes and fixes here and there throughout

I think this is getting close to being done now. (Unless a bug is found etc)

I am still investigating doing a Markdown version of the docs at some stage. And a request has been made to put it on GitHub so that may follow eventually.

Thanks to Carl for suggesting that in the first post in this thread, where there are release lines, we now include a link forward to the corresponding announcement.

1 Like

Version 1.13 of VitRegex is released:

 - fixed an "off by one" error in POSIX character classes inside bracket
   expressions that was introduced in vers 1.05 - we were advancing one
   character too few for the next loop iteration.

 - previously in analyseFirstCharSet I had a statement at the bottom
   "Non-alternating groups would require recursive analysis - skip for now"
   This was intended as something to come back to at a later date (and that
   later date is now). That's a roundabout way of saying I had forgotten all
   about it until now <g>.  The code has now been extended to cover this as
   previously a non-alternating group with something like say "(\d+)foo" or
   "([A-Z]+):" as the first token meant the first-character optimisation
   was not done but completely skipped. We now do this by looking at the
   first token inside the group.

 - found another similar comment in analyseFirstCharSet:
    ! Merge this token's first chars into merged set
    ! (Simplified: just handle literals and char classes for now)
   This meant that patterns like \d, \w, \R, etc were not optimised as the
   group-alternation only handled L (literals) and C (Character classes).
   The code has now been expanded to do a full optimisation.

 - and found yet another similar earlier comment again in analyseFirstCharSet:
    ! Groups are complex - could contain alternation, optional elements, etc.
    ! For now, skip optimisation if branch starts with a group
    ! (Could be enhanced later to recursively analyse the group)
   So again the time has come to flesh this out for a fuller optimisation.
   Note that this is in the top-level alternation branch-scanning loop
   (where scanForAlternation(1) is true), and up to now it aborted entirely
   whenever a top-level branch started with a group so things like
   "(cat|dog)|fish" or "(\d+)|abc" were not optimised - but they are now.

So one bug fix and three lots of optimisations to analyseFirstCharSet.

I should add that my test suite has now passed 900 tests. :tada:

2 Likes

Version 1.14 of VitRegex is released:

- in analyseFirstCharSet, the test for "of 81" ('Q') was lumped in with
  simple zero-width tokens that just need tokenPos += 1. But 'Q' assertions
  have a body that extends to its linkIndex. So just incrementing tokenPos by
  one walked INTO the assertion content, not past it, causing the first
  character of the assertion's body to be mistakenly used as the first char
  of the branch. Kudos to Claude for noticing this bug.  It was introduced
  in vers 1.04 and had been hiding in plain sight since then. (I've now
  added specific tests for this to guard against any future regression).

- in yesterday's Version 1.13 in the new "of 71 code",
  self.scanForAlternation(tokenPos + 1) was not bounded by (closingPos - 1)
  and could therefore potentially find an outer | and take the "alternating
  group" path for a non-alternating group like "(cat)" in "(cat)|dog".
- note this did *not* create a bug as the alternating path has
  getNextAlt(x, closingPos) properly bounded so it was ultimately handled
  correctly, if somewhat serendipidously rather than by design.  You gotta
  get lucky some times <g>. Still, I decided it was better to "do it right"
  so have added an extra parameter, pEndPos, to scanForAlternation.
  So:
      scanForAlternation    Procedure(long pStartPos), bool
  is now
      scanForAlternation    Procedure(long pStartPos, long pEndPos=0), bool

  so when pEndPos is zero (the default when not supplied), which is the case
  when currently called from analyseLiteralPrefix, analyseFirstCharSet (the
  non-"of 71" paths) and findPos, it will continue to scan to the end of
  the pattern as before.

Version 1.15 of VitRegex is released

I am hoping this is the last version for a while as I’ve exhausted all optimisation strategies that I can come up with for the time being, but it is April Fools Day, so who knows?

This version fixes two optimisation misses:

 - do not do analyseLiteralSuffix when in multiline mode as it is not used as
   $ matches line endings not just absolute end.  Computing suffix might risk
   spuriously clearing literalPrefix when prefix=suffix.  While this would not
   create a wrong answer it would remove an optimisation and slow things down.

 - fixed missed optimisation in analyseFirstCharSet, non-alternating group
   section - needed to add one so
      x = self.patterns.linkIndex + 1
   rather than
      x = self.patterns.linkIndex
   as the loop was fetching the assertion's closing ) token (type G), which
   is *not* in the consuming-token case list hence the correct first
   character set was never computed.  Note this was a performance miss, not
   a correctness bug - the engine still produced correct results, just
   without the position-skipping optimisation.