Over on the Clarion 11 newsgroup Steve asked:
Is there a quick way to get the file layout of a .csv file?
I want to do some data processing on the csv and a .tps file.
A couple of people quickly mentioned you could just import the file into the dictionary, although when setting the column widths it only checks the first 100 rows.
Wolfgang mentioned:
if you know the columns and if you know the delimiter, you could write
a tiny tool with StringTheory, which parses the .CSV in advance an then you can
estimate the required column widths from StringTheory.You could even make that tiny tool to generate the tablestructure for you.
Takes some work to tinker such a tool. But who does not like to do some extra
work to save some extra work?
I responded that:
I had the same thought - but of course the irony is that if you have StringTheory then you don’t really need the table structure!
Of course if you were going to then import the csv into another format like tps or some SQL database then knowing the field lengths would be useful.
so with that in mind I wrote a function CreateLayoutFromCSV that does exactly that…
as usual I used my favourite Clarion tool StringTheory (ST) which is ideal for any kind of slicing and dicing involving strings.
the split method of ST (StringTheory) makes parsing CSV very simple as documented at:
https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#ParsingCSVFile
there was recently some talk on the CW-Talk Skype group about compliance with RFC 4180.
this is a “standard” for CSV’s and Wikipedia mentions for RFC 4180:
-
MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
-
An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
-
Each record should contain the same number of comma-separated fields.
-
Any field may be quoted (with double quotes).
-
Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)
-
If double-quotes are used to enclose fields, then a double-quote in a field must be represented by two double-quote characters.
anyway the sample code linked above in the ST docs currently (Vers 3.38) has:
str StringTheory
lne StringTheory
fld StringTheory
x Long
y Long
code
str.LoadFile('Somefile.CSV')
str.Split('<13,10>','"')
loop x = 1 to str.Records()
Lne.SetValue(Str.GetLine(x))
Lne.Split(',','"','"',true)
loop y = 1 to lne.records()
fld.setValue(lne.GetLine(y))
if fld.replace('""','"')
lne.SetLine(y, fld.getValue())
end
end
field1 = Lne.GetLine(1)
field2 = Lne.GetLine(2)
! and so on...
end
this does NOT allow for the doubled up quotes mentioned in RFC 4180
so if you had a description field with [3" nails] or [12" vinyl LP] you are likely to see [3"" nails] and [12"" vinyl LP] instead.
to get around this I suggested adding a little more code to check for “” in fields and replace with "
str StringTheory
lne StringTheory
fld StringTheory
x Long
y Long
code
str.LoadFile('Somefile.CSV')
str.Split('<13,10>','"')
loop x = 1 to str.Records()
Lne.SetValue(Str.GetLine(x))
Lne.Split(',','"','"',true)
loop y = 1 to lne.records()
fld.setValue(lne.GetLine(y))
if fld.replace('""','"')
lne.SetLine(y, fld.getValue())
end
end
field1 = Lne.GetLine(1)
field2 = Lne.GetLine(2)
! and so on...
end
now we have that sorted out we can base our processing of a CSV file to create a file layout roughly on that code…
I have created a procedure that takes the csv file name as a parameter:
!==============================================
CreateLayoutFromCSV PROCEDURE (String pFileName)
! written on 11th June 2021 by Geoff Robinson vitesse AT gmail DOT com
! (c) 2021 by Geoffrey C. Robinson - Released under the MIT License
str StringTheory
lne StringTheory
fld StringTheory
errs StringTheory
out StringTheory
x Long,auto
y Long,auto
lines Long,auto
cols Long,auto
lengths Queue
row1 long
max long
end
CODE
if not str.LoadFile(pFilename)
fld.setValue('Error: Failed to load ' & clip(pFileName) & ' Error: ' & str.lastError)
do ReportError
return
end
str.Split('<13,10>','"')
str.removeLines() ! get rid of any empty lines
lines = str.records()
if not lines
fld.setValue('Error: No CSV lines loaded for ' & clip(pFileName))
do ReportError
return
end
Lne.SetValue(Str.GetLine(1)) ! get first line (possibly header)
Lne.Split(',','"','"',true)
cols = lne.records()
clear(lengths)
loop y = 1 to cols ! set row1 width of each column
fld.setValue(lne.GetLine(y))
fld.replace('""','"')
lengths.row1 = fld.length()
add(lengths)
end
loop x = 2 to lines ! check each row to get max width for each column
Lne.SetValue(Str.GetLine(x))
Lne.Split(',','"','"',true)
if cols <> lne.records()
fld.setValue('Inconsistent number of columns on line ' & x & |
': Expected ' & cols & ' but found ' & lne.records())
do reportError
end
loop y = 1 to lne.records() ! loop through columns
fld.setValue(lne.GetLine(y))
fld.replace('""','"')
get(lengths,y)
if errorcode()
lengths.row1 = 0
lengths.max = fld.length()
add(lengths,y)
fld.setValue('Adding column ' & y & ' to lengths queue on line ' & x)
do reportError
elsif fld._DataEnd > lengths.max
lengths.max = fld._DataEnd ! new maximum width for this column
put(lengths)
end
end !loop cols
end !loop lines
! now output format twice - first assuming line 1 is header and then assuming it is just data
out.setValue(fld.CleanFileName(fld.FileNameOnly(pFileName,false)))
out.replace(' ','') ! remove any spaces: same as out.remove(' ')
if out.length() < 18 then out.setLength(18).
out.append(' FILE,DRIVER(''BASIC''),NAME(''' & clip(pFileName) & '''),PRE(' & clip(upper(out.slice(1,3))) & ')<13,10>')
out.append('record RECORD<13,10>')
Lne.SetValue(Str.GetLine(1)) ! get first line (assume it is header)
Lne.Split(',','"','"',true)
cols = lne.records()
loop y = 1 to cols
fld.setValue(lne.GetLine(y))
do CapitalizeFieldName
get(lengths,y)
do AddFieldLine
end
out.append(' END<13,10>')
out.append(' END<13,10>')
out.append('<13,10>!----------------------------------------------<13,10,13,10>')
! now output format assuming no header record (ie. first line is data)
out.append(out.before('<13,10>') & '<13,10>') ! copy first line
out.append('record RECORD<13,10>')
loop y = 1 to cols
get(lengths,y)
if lengths.row1 > lengths.max then lengths.max = lengths.row1.
fld.setValue('Label' & y)
do AddFieldLine
end
out.append(' END<13,10>')
out.append(' END<13,10>')
out.append('<13,10>!----------------------------------------------<13,10>')
if errs.length() ! do we have any errors logged?
out.append('<13,10>Errors:<13,10,13,10>')
out.append(errs)
out.append('<13,10>')
end
fld.setValue(pFilename)
fld.setValue(fld.pathOnly() & '\' & fld.fileNameOnly(,false) & |
' layout on ' & clip(left(format(today(),@d12))) & |
' at ' & clip(left(format(clock(),@T05))) & format(clock()%100,@N02) & '.txt')
out.saveFile(fld.getvalue())
! stop('saved layout for ' & clip(pFileName) & ' to ' & fld.getValue())
AddFieldLine ROUTINE
if fld.length() < 22 then fld.setLength(22).
if lengths.max
fld.append(' STRING(' & lengths.max & ')<13,10>')
else
fld.append(' STRING(1) ! note: no data found<13,10>')
end
out.append(fld)
ReportError ROUTINE ! note: error is in fld
errs.append(fld.getValue() & '<13,10>')
fld.trace()
stop(fld.getValue() & '<13,10>(Error creating layout from ' & clip(pFileName) & ')')
CapitalizeFieldName ROUTINE ! note field name is in fld
data
containsLower Long,auto
containsUpper Long,auto
i Long,auto
code
fld.KeepChars('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_: ',' ')
fld.Split(' ')
loop i = 1 to fld.records()
fld.setValue(fld.getLine(i))
if fld.length() < 1 then cycle.
containsLower = fld.containsA('abcdefghijklmnopqrstuvwxyz',fld.slice(2)) ! exclude 1st char which will be UPPERed
containsUpper = fld.containsA('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
if containsUpper
if containsLower
if ~inrange(val(fld.valuePtr[1]),val('a'),val('z')) ! first char not lower?
cycle ! do NOT change this word in field name label - take it as the mixed case given
end
else
fld.lower()
end
end
fld.valuePtr[1] = upper(fld.valuePtr[1])
fld.setLine(i,fld.getValue()) ! update value in queue of words
end
fld.join('') ! join the words back together to give the field name label
if fld.length() < 1
fld.setValue('_') ! blank field name
else
case val(fld.valuePtr[1]) ! first char should be alpha or underscore
of val('A') to val('Z')
orof val('_')
! all ok
else
fld.prepend('_')
end
end
Running this on a file with a header line:
"ID","Invoice","Inv Date","Amount","G.S.T","INV TOTAL","banked date","Client","transType"
gave the following:
test1 FILE,DRIVER('BASIC'),NAME('.\testdata\test1.csv'),PRE(TES)
record RECORD
Id STRING(3)
Invoice STRING(8)
InvDate STRING(10)
Amount STRING(7)
GST STRING(4)
InvTotal STRING(7)
BankedDate STRING(10)
Client STRING(26)
TransType STRING(13)
END
END
!----------------------------------------------
test1 FILE,DRIVER('BASIC'),NAME('.\testdata\test1.csv'),PRE(TES)
record RECORD
Label1 STRING(3)
Label2 STRING(8)
Label3 STRING(10)
Label4 STRING(7)
Label5 STRING(5)
Label6 STRING(9)
Label7 STRING(11)
Label8 STRING(26)
Label9 STRING(13)
END
END
!----------------------------------------------
As you can see, what it does is create TWO definitions, one assuming the first line contains the field names, and the second that assumes there is no header line.
Note that the string lengths could be different between the two layouts where a column name in the header is wider than the largest data field.
There is some very rudimentary error handling in ReportError Routine. You may wish to expand this or get rid of the STOP() etc. Errors are logged and added to the end of the format file.
The format file is the name of your data file with "layout on <date> at <time>"
added to the end. So each time you run it you will get a new time-stamped file.
The CapitalizeFieldName Routine has got some smarts in it to try to preserve word boundaries through capitalizing first letters. Look, for example, at the output above for:
"ID" -> Id
"Inv Date" -> InvDate
"G.S.T" -> GST
"INV TOTAL" -> InvTotal
"banked date" -> BankedDate
"transType" -> TransType
you could easily change this if you prefer to use underscores between words and so on.
anyway I hope this proves useful - please have a play and report any errors/bugs to me and I will have a look.