Flat Serializer: A class for serializing and deserializing Clarion structures to CSV (or TSV)

Tags: #<Tag:0x00007fc0e220cf60> #<Tag:0x00007fc0e220c948> #<Tag:0x00007fc0e220c2e0>

A few months ago, I had to develop two features:

  • Export and import data to Excel files
  • Publish the TSV output of some existing Clarion processes to JSON/XML webservices

For both, I needed a straightforward way to convert between Clarion structures and Comma or Tab Separated Values.

I looked at the excellent CSVParseClass, but it didn’t have many of the features I needed, and for the parsing part I already had code I wrote years ago, so I decided to create a new class with a different approach (although I used and still use the CSVParserClass demo to create queue declarations from sample CSV files).

The class has been in production for months, but I finally had time to clean it, partially document it and share it. It’s available at GitHub.

To serialize a queue to a text file:

myQ QUEUE,PRE(myQ)
id    LONG
Name  STRING(30)
Date  DATE
Time  TIME
    END

fs FlatSerializer

  CODE
  ...
  fs.Init
  fs.SerializeQueueToTextFile(myQ,'testqueue.csv')

The resulting testqueue.csv text file will look like this:

   ID,NAME,DATE,TIME
   5,"Some Name",2021-07-30,18:45:56
   7,"Another Name",2021-12-16,08:12:34

To load the same text file to a queue:

  FREE(myQ)
  fs.Init
  fs.LoadTextFile('testqueue.csv')
  fs.DeSerializeToQueue(myQ)

For more details, please view the Readme file. Also, the file Tests.clw includes a few unit tests that can serve as examples of how to use all the methods.

The class is named Flat Serializer because CSV/TSV are a flat file formats, and because it flattens Clarion structures. For example, a group inside a group will be flattened like this:

MyGroup    GROUP
SomeString   STRING
FullName     GROUP
FirstName      STRING
LastName       STRING
             END
           END

SomeString,FullName,FirstName,LastName
Abc,Carlos      Gutierrez,Carlos,Gutierrez

You can use AddExcludedFieldByName() or AddExcludedFieldByReference() the exclude either Fullname or FirstName and LastName from the output.

The class uses the TUFO interface, published by Oleg Rudenko and Mike Duglas.

Feedback is welcome.

Edit Oct. 17, 2021

Following Federico Navarro’s lead I added a test using a sample file with 100k lines and 38 columns, and made some optimizations. These are the results:

Change Seconds
Base line time (first release) 43.6
String reference and slicing when loading file -6.3
Readonly mode and filebuffers -0.5
Precomputed LENs -10.2
DeformatColumnValue optimization -2.6
Pre-resolving field aliases -5.7
Final time 18.3

I also added a tiny local class fsDynString (inspired by the StringClass coded in SV’s libsrc\win\xmlclass.inc and TreeViewWrap.clw) to replace ANY as the unknown length string storage. It didn’t have any noticeable effect on performance in deserializing, probably it helps when serializing, but I didn’t benchmark that.

4 Likes

Good work Carlos. Very interesting. Thanks for sharing.

Some feedback:
-For large data, parhaps code could be optimized to improve processing time, with for example with precomputed LEN functions, and specifically LEN(pText) could be SIZE(pText) on LoadString
-BlankSeparators purpose and specific treatment for = could be avoided, as they would prevent biyectiveness => data = deserialize(serialize(data)). Also, you could send standard escaped quoted text in normal data and formulas, eg. 1,3.5,=SUM(A1:B1),"=SUBSTITUTE("“aa,;”","“a”","“b”",1)",col5
or
1;3,5;=SUMA(A1:B1);"=SUSTITUIR("“aa,;”";"“a”";"“b”";1)";col5
-Just to note TestAPP is using an internal copy of the class, and pass results to debug, no screen output.

1 Like

Hi Federico

Thanks a lot for the feedback.

I added precomputed LENs to LoadString as you suggested. I didn’t benchmark but, in this case, I think it also helps with the code readability.

Agreed, I think I’ll add a property to enable/disable this behavior.

Actually I code the class inside the testapp. The template SourceClass is very useful, it lets you edit the class inc and clw content at the same time using the embeditor.

If follows the “no news is good news” idea :slight_smile:. If everything works, the app just finishes. If one of the tests fails, it shows the result in a STOP() and copies it to the clipboard, and I usually have a Debugview++ window open to look up any details.

An alternative is to uncomment the lines:

  !StringToFile(TestsResult,'TestsResult.txt')
  !RUN('TestsResult.txt')

To display all the test results in your default text editor when the testapp finishes (there was a missing line to make this work, fixed).

I 've just pushed these changes to GitHub.

Regards, and thanks again for your ideas, I really appreciate it.

Hi Carlos,

I added a routine on TestApp to load a 28MB CSV that Jeff included in his project:

The code is:

TestCSV ROUTINE
DATA
fs FlatSerializer
TestResult ANY
!Queue definition by CSVReader:
MyQueue QUEUE,PRE(My)
id STRING(5)
first_name STRING(15)
last_name STRING(25)
email STRING(41)
ip_address STRING(15)
favorite_animal STRING(40)
favorite_color STRING(10)
favorite_guid STRING(36)
password STRING(12)
favorite_words STRING(584)
bitcoin_address STRING(34)
credit_card_type STRING(25)
currency STRING(13)
favorite_stock STRING(62)
balance STRING(6)
ticker STRING(14)
industry STRING(62)
favorite_plant STRING(40)
favorite_company STRING(41)
avatar STRING(82)
auto STRING(23)
buzzword STRING(24)
END
CODE
DebugView(TestsResult)
fs.Init
TestResult = FORMAT(TODAY(),@D10)&’ ‘&FORMAT(CLOCK(),@T04)&’ Init done’&RECORDS(MyQueue)
DebugView(TestResult)
fs.LoadTextFile(‘E:\test\CSVParseClass\SampleData\CSVDemo50K.Comma.CRLF.csv’)
TestResult = FORMAT(TODAY(),@D10)&’ ‘&FORMAT(CLOCK(),@T04)&’ LoadTextFile done’&RECORDS(MyQueue)
DebugView(TestResult)
fs.DeSerializeToQueue(MyQueue)
TestResult = FORMAT(TODAY(),@D10)&’ ‘&FORMAT(CLOCK(),@T04)&’ DeSerializeToQueue done. Records '&RECORDS(MyQueue)
DebugView(TestResult)
AssertEqual(50000,RECORDS(MyQueue),‘TestCSV: Records loaded’)

Originally it took about 1 minute to load, profiling some internal steps there were aproximately 20 seconds to load the CSV on a STRING (function FlatSerializer.StringFromTextFile), 30 seconds to process the STRING an build the pointers structure (FlatSerializer.LoadString) and 9 seconds to load the Queue (FlatSerializer.DeSerializeToQueue) on an intel i7.

Optimizing StringFromTextFile with String Slicing (and driver BUFFERS) it reduced from 20 to less than 1 sec. Both are techniques used on SystemStringClass.FromFile
Optimizing LoadString with precomputed LENs it reduced from 30 to 13 secs aprox.

About Excel formulas I thought about a flag, but found it is not necessary, default behavior should handle it right (it is “=…”, not ="…").

I saw testApp is coded like unit tests, it is interesting.

Thanks to you

Federico

1 Like

Hi Federico

I updated the class following your ideas, thanks a lot. More details in the Edit.

Yes, here’s an empty TestApp: https://github.com/CarlosGtrz/TestApp in case you want to play with it.