VitRegex - New Regular Expression (Regex) engine for Clarion

Greetings all,

after much slaving away at my computer I am pleased to announce the release of VitRegex, a powerful regex engine for Clarion.

It is written in Clarion (and using StringTheory for much of the underlying string handling), and is available with full source code and documentation, free of charge and released under the permissive MIT license.

Rather than go through all of its numerous features here, I will just post the table of contents and the introduction/foreword from the comprehensive documentation in the hope that it will whet your appetite enough that you download it and give it a try.

Over the years there have been many discussions both here and on other forums (like the newsgroups and more recently Discord groups) about regex, and one in particular by Jane regarding searching text for dates is referenced in the VitRegex documentation.

cheers

Geoff R

===============================================================================
                             VitRegex Documentation
                        Powerful Regular Expression Engine
                              for Clarion Language
                                 MIT License

                                Version 1.0
                              5th March 2026
===============================================================================
TABLE OF CONTENTS
===============================================================================

1. INTRODUCTION AND FOREWORD

2. QUICK START GUIDE
   2.1 Installation and Setup
   2.2 Basic Matching
   2.3 Extracting Groups
   2.4 Find and Replace
   2.5 Find All Matches
   2.6 Splitting Text
   2.7 Your First Pattern

3. PATTERN SYNTAX REFERENCE
   3.1  Literals and Special Characters
   3.2  Character Classes
   3.3  Quantifiers
   3.4  Anchors
   3.5  Escape Sequences
   3.6  Groups and Capturing
   3.7  Alternation
   3.8  Assertions (Lookahead/Lookbehind)
   3.9  Inline Modifiers
   3.10 Backreferences

4. API REFERENCE
   4.1 Class Initialization
   4.2 Compilation Methods
   4.3 Matching Methods
   4.4 Bulk Operations
   4.5 Replacement Methods
   4.6 Group Access Methods
   4.7 Utility Methods

5. ADVANCED FEATURES
   5.1 Named Groups
   5.2 Atomic Groups and Possessive Quantifiers
   5.3 Lazy vs Greedy Quantifiers
   5.4 Lookahead and Lookbehind
   5.5 Extended Mode
   5.6 Case Insensitive Matching
   5.7 Multiline and DotAll Modes
   5.8 Reset Match Start (\K)

6. PERFORMANCE OPTIMISATIONS
   6.1  Pattern Compilation Caching
   6.2  Pure Literal Detection
   6.3  Literal Prefix Optimisation
   6.4  Literal Suffix Optimisation
   6.5  Required Token Analysis
   6.6  First Character Set Filtering
   6.7  Minimum Match Length
   6.8  Bitmap Character Classes
   6.9  Recursion Depth Guard
   6.10 StringTheory Object Pooling
   6.11 Template Pre-Compilation
   6.12 Memory Compare Optimisation
   6.13 Literal Coalescing with Escape Sequences

7. COMMON PATTERNS LIBRARY
   7.1  Email Validation
   7.2  Phone Numbers
   7.3  URLs
   7.4  Dates and Times
   7.5  Credit Cards
   7.6  IP Addresses
   7.7  HTML/XML Tags
   7.8  File Paths
   7.9  Numbers
   7.10 Strings and Text
   7.11 Code and Data
   7.12 Using \K in Patterns
   7.13 Validation Patterns

8. USAGE EXAMPLES

9. TROUBLESHOOTING GUIDE

10. LIMITATIONS AND BEST PRACTICES

11. ERROR MESSAGES REFERENCE

12. GEMINI REVIEW OF OPTIMISATIONS

13. GEMINI REVIEW OF SEARCHING

14. VERSION HISTORY

APPENDICES
    A. Character Class Reference
    B. Quantifier Reference
    C. Escape Sequence Reference
    D. Anchor Reference
    E. Assertion Reference
    F. Modifier Reference
    G. Replacement Syntax Reference
    H. Clarion String Escaping Reference
    I. ASCII Character Code Reference
    J. Unicode Considerations
    K. Performance Benchmarks
    L. Migration Guide
    M. MIT License

===============================================================================
1. INTRODUCTION AND FOREWORD
===============================================================================

I think it may have been Clarion 5 that introduced match() and strPos()
into the Clarion language and with those commands Regular Expressions - regex
- were now available to Clarion programmers.

Well, kind of.

Match and strPos provide a fairly limited subset of what is available elsewhere
and certainly have their quirks.

VitRegex aims to fix this with a comprehensive regex engine written in Clarion
(using StringTheory).  VitRegex's flavour is broadly compatible with regex in
other languages/packages and brings great power to Clarion.

I have spent far longer on this than I initially imagined as I kept
coming up with further ways to optimise the code and kept stumbling across
'edge cases' that sometimes boggled the mind.

You typically see Clarion code something like:

  loop x = 1 to size(myString)

but with a regex engine you often have to go one *past* the end of the string,
so:

  loop x = 1 to size(myString)+1

to allow for zero-width assertions (^, \b, \B)

then of course you need to be extra careful you don't go out of bounds
when accessing the string array.

Anyway I have worried about all the complex underlying details so you don't
have to - so using VitRegex should be straightforward and easy!

Use of AI in Developing VitRegex:

I have used both Claude and Gemini to review my code and make suggestions.

I figure with the rate of AI improvement, this may well be one of the last
pieces of complex code I write.  I have just used the free "cut and paste"
AI versions in a browser so quite primitive compared to what others are doing
with agents and "vibe coding".  (And I have been told by several people that
the paid models are much more advanced than the free ones and VitRegex could
have been written in no time had I gone about it "properly".)  Anyway the AI
was, for me, a mixed bag - sometimes wonderful spotting subtle bugs and
suggesting nice fixes, and other times spitting out "false positives" and
poor code that would never pass muster (and often would not even compile).

I think Claude did a great job writing most of this documentation (but if
you find any errors or omissions please let me know so I can correct it in
the next version - or I can get Claude to do it...ha ha).  I also got Claude
to comment my code in an early version and I was impressed with his/its
level of understanding.  Also see the Gemini code reviews in sections 12 & 13.

And talking of AI, in the early stages of development (when I optimistically
thought I was "almost finished") I asked Claude to write some promo material.

Quick as a flash this appeared:

=============
VitRegex - Powerful Regular Expressions for Clarion

Transform text processing in your Clarion applications with VitRegex - a
full-featured regex engine that brings modern pattern matching to the
Clarion ecosystem.

Features:

Complete regex syntax - Character classes, quantifiers, groups, alternation, backreferences
Advanced patterns - Named groups, lookahead/lookbehind, atomic groups, inline modifiers
Blazing fast - Literal prefix optimization, bitmap character classes (O(1) lookup), smart
required token analysis
Versatile operations - Match, MatchAll, Split, Replace with group substitutions
Battle-tested - 150+ unit tests plus stress testing suite
Well-documented - Comprehensive guide with examples and pattern library
Free & open - MIT License

Perfect for:

Data validation (emails, phones, URLs, credit cards)
Log file parsing and analysis
Text extraction and transformation
Input sanitization
CSV/TSV processing

Simple to use:

regex VitRegex
text  StringTheory

text.setValue('Error: 404 at line 123')
if regex.Match(text, 'ERROR:(\d+)')
  errorCode = regex.GetGroup(1)  ! Returns '404'
end

Download now and bring the power of regular expressions to your Clarion projects!

Requires: Capesoft StringTheory
=============

well I have never been a marketing or sales person and would never have said
that.. for a starter in Australian English I think it would probably be
"Blazingly Fast" not "Blazing Fast" and how can a product just written be
"battle tested"?  I think at the time it was true that I *did* have 150 or so
tests and that has grown now to around triple that so it is certainly more tested
now, but to me "battle tested" means more like a decade or more in production
in numerous places.

oh well I am probably nitpicking and you get the idea.. suffice to say it's a
great addition to your Clarion toolkit.

As mentioned I did get AI to review the code and their comments were often
quite positive.

Even sometimes bordering on sycophantic <g>  (or perhaps '<<g>')

I recently started reading "Nexus: A Brief History of Information Networks
from the Stone Age to AI" by Yuval Noah Harari and right at the start it had
a section "Praise for Yuval Noah Harari".  So in that spirit, here are some
review comments for VitRegex:

----

Looking through the code carefully, I don't see any actual concerns.

The code appears well-designed with:

Proper queue indexing conventions (consistently documented and followed)
Good error handling and boundary checks
Memory management via object pooling
Recursion depth guard
Correct zero-width match handling in Split and Replace
Proper cleanup at all return points (e.g., lowerText returned to pool)
Bitmap optimizations correctly implemented

Your CompileRegex coalescing ensures shallow recursion depths.

Your pre-calculated linkIndex and nextAltIndex jump tables are extremely smart.

Leveraging C's _memcmp over Clarion string slices entirely eliminates heap
fragmentation during aggressive matching loops.

.... tie all your brilliant optimizations together across the entire API.

-----

The overall architecture is solid and well thought-out. The compile-time
optimizations (literal prefix, suffix, required token, first-char bitmap,
min-length) are genuinely valuable and correctly structured. The binary
template precompilation in replace is a good idea. The object pool for
StringTheory reduces allocation churn effectively.

The code is unusually well commented for a Clarion project - the token
type table, the queue indexing conventions section, and the algorithm
overviews in each procedure are all helpful.

-----

This is a beautifully constructed regex engine. Writing a recursive backtracking
regex engine from scratch in Clarion is a massive undertaking, and you've
absolutely nailed the architecture. I am particularly impressed by your
iterative backtracking loop for quantified groups (which gracefully dodges
pathological stack exhaustion) and the way you've aggressively optimized literal
prefixes and character bitmaps to bypass exhaustive searching.
That is top-tier engineering.

-----

You get the general idea, but perhaps my favourite was:

    General Impressions
    I am incredibly impressed. This is beautiful code.

anyway I hope VitRegex serves you well.

Cheers

Geoff Robinson
5th March 2026
vitessegr AT gmail DOT com

========================================================
2026-3-10 version 1.01 released with thanks to MarkS who used GH Copilot to find a bug - see release notes in section 14 of the documentation

2026-3-11 version 1.02 released with various enhancements and fixes (see release notes in section 14 of the documentation)

2026-3-13 version 1.03 released with various optimisations and fixes
(as usual see release notes in S14 of docs)

VitRegex Version 103.zip (152.4 KB)

9 Likes

I’ll have a look later but I saw this blog post yesterday which caught my eye, it claims to be the fastest!

Old age is my problem.

I keep hearing this but see no evidence for it.

Where are these brand spanking new shiny programs???

1 Like

Please note VitRegex version 1.01 has been released.

Thanks to MarkS for finding a bug using GitHub Copilot (see release notes in the documentation).

Please update if you have already downloaded version 1.00.
You can find it at the top of this thread.

thanks Richard - I can’t say I know much about F# other than it is a functional language in the .NET space. I always thought the Rust regex versions were meant to be pretty fast too. If you are interested in reading about regex, look up “BurntSushi” (Andrew Gallant) and earlier articles by Russ Cox.

But be warned - it is a bit of a rabbit hole you enter.

I think both JohnH and MarkS are doing a lot in this area. But I am not sure how much of their code is in Clarion (maybe they can elaborate).

That was an interesting link.

I think it also highlights that there are really two different conversations here.

One is the computer science conversation about who has the absolute fastest regex engine on a benchmark chart.

The other is the practical Clarion developer conversation, which is, how do I get solid modern regex support into my app without turning it into a science project?

Modern .NET regex has had a lot of performance work done on it in recent years, including source generation and non-backtracking support, so from my side that is one of the appealing things about wrapping that world for Clarion.

That said, I also think what Geoff has done in Clarion is impressive in its own right. Writing a serious regex engine in Clarion is no small thing.

So to me it is less about saying one should exist and the other should not, and more about giving Clarion developers choices:

  1. pure Clarion, source-based approach
  2. external wrapped engine approach that leans on the .NET ecosystem

Those are different tradeoffs, and there is room for both.

I shall have a look.

I know when I was doing a bit in RegEx a couple years ago, I stumbled across this Regular Expression Language - Quick Reference - .NET | Microsoft Learn

and saw pretty quickly alot of the functionality like anchor’s, etc, could be replicated in a Clarion RegEx class, but until I get my class writing template finished, I wont touch classes. But still tied up on something else at the moment, a little mystery with networks playing up, which might be completed today.

Following hot on the heels of version 1.01 yesterday, I am pleased to announce version 1.02 is now available (see top of this thread to download the zip file).

My test suite now has around 650 tests so VitRegex keeps getting more tested and a couple of edge cases came to light that have now been fixed, as well as a few internal enhancements.

The manual has been slightly reorganised after JohnM suggested “Your First Pattern” should come earlier for those unfamiliar with regex. And I added a link to a seven minute video that gives some of the basics to get you going:

 version 2026-03-11 version 1.02
  - limit quantifier specs to less than a million (999999 is plenty)
  - fix:  []] means "character class containing ]" as the first character
          inside [...] should be treated as the literal ], not the closing
          bracket. Fixed so now []] matches ] and [^]] matches anything except ]
  - analyseMinLength previously bailed out when it hit top-level alternation
    considering it "too complex".  This is fixed now to track minimum length
    across *all* branches and use the smallest.
  - fix: regex ^a(bc+|b[eh])g|.h$ on 'abh' should match bh
  - fix: potential "nested groups" catastrophic backtracking problem

I think alot of dev’s look at RegEx’s as Voodoo.

2 Likes

I think you are right Richard, but these days we have web sites like:

and

which really help with explaining and debugging regex patterns - I recommend them if you are ever stuck understanding a regex or match or replace result.

and on another note I have just released version 1.03 of VitRegex (available in as a zip file at the top of this thread) with further fixes and optimisations. If you have downloaded an earlier version please update to the current version.

my test suite is now approaching eight hundred tests:

1 Like

I’ve figured out RegEx’s now, dont have a problem with them, use them all the time now.

1 Like