UUID v7 - Universally Unique Lexicographically Sortable Identifier (ULID)

Gustavo_Saracca · August 26, 2024, 2:35pm

This simple library is an approach to the ULID defined in RFC4122, allowing the generation of time-ordered UUIDs. The code is written in pure Clarion and has been tested in C10.

While implementations using (today() and cloc()) worked correctly, this is a better implementation as it uses the API call GetSystemTime(), which returns milliseconds, thus improving the ordering.

The example (testing code), 1 million ULIDs are generated in less than 7 second, which is very good for pure Clarion code.

NewUUIDv7Array need an 16-byte array.
NewUUIDv7 call NewUUIDv7Array an “flag” is used to designate a lower or upper case HEX like CWUtil.

module( 'ULID.CLW' )
    SetGMT( long _gmt )
    NewUUIDv7Array( *byte[] _ulid )
    NewUUIDv7( long flag=0 ),string
end !* module *

I hope you find it useful, and any recommendations or improvements are welcome.

ULID implementation in Clarion

Bruce · August 27, 2024, 8:18am

Did you really mean to release this under the GPL? That means Clarion folks can’t really use it because it would then make their code GPL as well, which I’m guessing most people don’t want.
If you changed the license to MIT then it would be ok to use.

That said, you do have a small flaw in your code. And it comes from a pretty vague spec (I made the same mistake in my first iteration of writing this as well.)

the spec (quoting from Wikipedia here) says;

Remaining 74 bits are random seeded counter (optional, at least 12 bits but no longer than 42 bits) and random.

What they’re trying to say here is that the 74 bits are split into 2 parts;
The first part is a sequential counter, which increments by random amounts.
The second part is pure random.
The first part should be at least 12 bits long, max 42 bits long) and the rest should be random.
(You have all 74 bits as random).

Frankly, for our use cases, the sequential part is overkill - but I thought I’d let you know since you were kind enough to post the code for everyone to use.

CarlBarnes · August 29, 2024, 7:38am

This Unix Time calculation example may interest you that uses a FILETIME type and GetSystemTimeAsFileTime().

Because that type is the number of 100th Nanoseconds since 1/1/1601 the math is simple subtract and divide. I would suggest you use the i64 library for all the math and not the LARGE_INTEGER. You would have to test that it is just as fast as your current method.

Int64 GetSystemTimeAsUnixTime()
{
   //Get the number of seconds since January 1, 1970 12:00am UTC
   //Code released into public domain; no attribution required.

   const Int64 UNIX_TIME_START = 0x019DB1DED53E8000; //January 1, 1970 (start of Unix epoch) in "ticks"
   const Int64 TICKS_PER_SECOND = 10000000; //a tick is 100ns

   FILETIME ft;
   GetSystemTimeAsFileTime(out ft); //returns ticks in UTC

   //Copy the low and high parts of FILETIME into a LARGE_INTEGER
   //This is so we can access the full 64-bits as an Int64 without causing an alignment fault
   LARGE_INTEGER li;
   li.LowPart  = ft.dwLowDateTime;
   li.HighPart = ft.dwHighDateTime;
 
   //Convert ticks since 1/1/1970 into seconds
   return (li.QuadPart - UNIX_TIME_START) / TICKS_PER_SECOND;
}

I would suggest you split out a separate function to Get Unix Time as useful on is own. I’d do that first with your current method, then create the new FILETIME way as a separate function. Compare the 2 results as within 1 second.

I would suggest you rename “flag” to make it obvious what it does and sort of self documenting. Also use the BOOL type to indicate True/False:

NewUUIDv7( BOOL LowerCaseFlag=0 ),string

Gustavo_Saracca · August 29, 2024, 3:11pm

Hi Bruce,

Thank you very much for your comments.

I changed the license to an MIT license, updated the algorithm to correct the GUID format, and optimized the algorithm and time cost by about 70%. I also decoupled the CWUtil routine with a faster BYTE to HEXA conversion version with two llokup tables. And finally, it is no longer necessary to link the ABC library, leaving the algorithm available for use by any implementation.

Current and fixed format:
“0191944D-8EC4-7A70-86F5-8AA506DEE33A”

At the moment the time per 1 million generation writing to a text file is around 4.5 seconds.

For the same algorithm in GO it is 2.25. I have finally noticed that writing the data to the example file in Clarion is extremely efficient compared to GO.

Gustavo_Saracca · August 29, 2024, 7:27pm

Hi Carl,

I’ve changed long to BOOL and I’m trying to generate a version with your suggestion but I’m having some problems with type conversions.

I appreciate the suggestion about the API call as I needed to find a solution to the precision issue. As the call is faster more GUIDs had the same TimeStamp. In my benchmark tests on the same test computer in GO, the implementation I based on, there were 3-5 identical TimeStamps when with my implementation I was getting 9.

CarlBarnes · August 30, 2024, 8:48am

I found the below Repo that was all about ULID with many implementations.

I’ve only looked at a few quickly. This C# .Net had a good Read Me and seemed interesting as a 26 byte string of letters that can be decoded. Maybe that’s not an official ULID.

Gustavo_Saracca · August 30, 2024, 10:49am

Hello Carl,

Two needs prompted me to create the library: one is the need to sort by ULID in a chronological manner and the other is what you point out regarding that implementation in C#. In that implementation, a decoding of the ULID is done, which allows other developers who exchange information with our systems to easily obtain the date and time at which the operations were performed.

I need to encode the ULID not only in a Clarion std String, I think it would be interesting to encode the 16-byte binary in BASE32 soon.

In summary:

It allows you to order the keys chronologically.
It allows others or yourself to decode the date and time of the operation.

vitesse · September 2, 2024, 5:50am

Hi Gustavo

I am not sure if you use StringTheory but I did Base32 encoding and decoding for ST which Bruce added in about a year ago. Might save you a bit of work.

https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#stBase32Decode
https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#stBase32Encode

(ST has lots of other goodies too… I am obviously biased [so take it with a pinch of salt] but I don’t code without it).

I should add that the standard default Base32 encoding uses the alphabet ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ234567=’ where last char is pad char.

In order to retain the order of the keys I suggest using an alternative alphabet (which can be passed as a parameter), specifically “extended hex base32” which uses ‘0123456789ABCDEFGHIJKLMNOPQRSTUV=’ and has the advantage that the encoded and unencoded values have the same sort order.

Finally you could remove ‘U’ from the alphabet as a “profanity filter” hence:
‘0123456789ABCDEFGHIJKLMNOPQRSTVW=’ or if ‘V’ looks too much like ‘U’ then ‘0123456789ABCDEFGHIJKLMNOPQRSTWX=’. That way you are less likely to offend anyone.

Along the same lines, Crockford’s Base32 excludes the letters I, L, and O to avoid confusion with digits and excludes the letter U to “reduce the likelihood of accidental obscenity.”
Hence: ‘0123456789ABCDEFGHJKMNPQRSTVWXYZ=’

You just need to be sure that both the encoding and the decoding are using the same alphabet (32 characters + pad char).

Bruce · September 3, 2024, 7:03pm

Using the UUID to hold “data” is a terrible pattern. If uou want a timestamp in the record then add a timestamp column to the record.

Reading the tine out the UUID just because it contains the time is indicitive of junior programmers who lack experience. (Hint: the fact that this is version 7 of ways to generate a UUID should be telling you not to rely on the way it is generated…)

Gustavo_Saracca · September 3, 2024, 9:13pm

I fully agree with the first paragraph. Embedding data within a UUID is a terrible idea if it is intended to be used as the primary method of transmitting information.

To clarify, the intent is not to use the UUID as the primary means of transmitting information, particularly if this implies avoiding the inclusion of a timestamp. However, in many cases, the information is sent to third-party applications in various ways, and timestamps do not always accompany it. In such scenarios, using a UUID can be a valid alternative for determining when an event occurred or an update was made in a specific API, app, S3, etc.

Additionally, v7-based indices offer the benefit of reducing index/key fragmentation from the moment they are implemented, thereby increasing or improving database performance. In short, it is an alternative that may or may not be used, depending on the situation.

As a side note, .NET 9 (preview 7) includes this feature.

Unleashing the Power of UUID v7 in .NET 9

jslarve · September 4, 2024, 4:48am

Hi Gustavo - That link on your readme.md a 404. This link says that RFC4122 has been “obsoleted” RFC 9562 - Universally Unique IDentifiers (UUIDs)

Also, I have a question about line 126 of your ULID.clw. What is the purpose of that BSHIFT vs just putting the result? Are you implying something that I’m missing?

Thank you.