TryInsert takes longer with more records

Tags: #<Tag:0x00007fc0d2fd4ab8>

graph

I guess it stands to reason that a TryInsert command would require more time as the number of records increases, but I only really noticed it today.

I am inserting 77,000 records and wanted to write some code to estimate the time to completion. The estimate kept going up, so I started recording the time taken divided by the number of records. The graph shows the time taken in seconds on the Y axis, and the number of records in the table on the X axis.

Actually it’s a cascading set of TryInsert, because the row being inserted into the main table is related to 5 other tables, which I insert stuff into before I do the final insert. I was supplied with a flat CSV file with all the data in it.

Comments anyone?

What driver?
Is it something where you could get exclusive access?
Are you using transactions (logout/commit)?

Donn, if you just insert the first 20,000 and stop. Then restart your program and insert the next 10,000 does the time per insert start at the higher time?

How many indexes did you end up putting on your tables?

Here is the relationship diagram in simplified form:

A3 Levels

All the data is supplied in a 36MB CSV file, just for a3_prod. (products). I can infer the data in a3_advertiser (for example) because the PK field is included, as well as the description. Similarly with each of the other tables. Here is the structure of the a3_prod table:

Every table has its PK, description, and the last 3 fields shown here.

So I have a primary key on brand3_id_original (LONG) and an index on product_name (CSTRING 200)
Then I have an index for the foreign keys: section_code_original, category_code_original, category3_code_original, advertiser_code_original all LONG, being the primary keys of a3_1level, a3_2level, a3_3level and a3_advertiser respectively.

a3_advertiser has a PK of a3_advertiser_original (LONG), and an index of advertiser_name (CSTRING 200)

a3_1Level has a PK of section_code_original (LONG) and an index on section_name (CSTRING 200).

a3_2Level has a PK of category_code_original, (LONG) and an index on category_name (CSTRING 200). In addition it also has section_code_original with and index, and section_name, because it is a subsection of a3_1level. I know that section_name is redundant, since there is a relationship between a3_1level and a3_2level, but I have kept it there for convenience anyway.

a3_3level has a PK of category3_code_original, (LONG) and an index on eng_category3_name (CSTRING 200). In addition it also has section_code_original with and index, and category_code_original with an index, plus the section_name and category_name fields. a3_3level is a subsection of a3_2level.

So there are 5 tables and several relationships all going on at the same time. I originally wrote and tested the code using TPS files, and then later used FM3 to convert the structure to the ODBC linked tables in PostgreSQL (32bit UNICODE ODBC driver).

The full module is here:
importa3.clw (11.1 KB)

But for simplicity I will highlight the key parts:

UD.Debug('ImportA3 - loading CSV file')
if ~stfile.LoadFile('C:\dev\nielsen\Ariana3.csv')   ! A3.csv ! Read the contents of the CSV file
    MESSAGE('Error opening ariana3.csv')
else ! A3.csv
    utstart = fmt.ClarionToUnixDate(today(),clock())    ! Start time
    stETA = 'busy ...'     
    UD.Debug('ImportA3 - deleting old data')  
    a3_prod{PROP:SQL}='DELETE FROM a3_prod'         ! Clean out the old data
    a3_advertiser{PROP:SQL}='DELETE FROM a3_advertiser' ! Clean out the related a3_advertiser data  
    a3_3level{PROP:SQL}='DELETE FROM a3_3level'     ! Clean out the related a3_3level data 
    a3_2level{PROP:SQL}='DELETE FROM a3_2level'     ! Clean out the related a3_2level data 
    a3_1level{PROP:SQL}='DELETE FROM a3_1level'     ! Clean out the related a3_1level data 
    OPEN(Progress)                                  ! Open the progress window
    UD.Debug('ImportA3 - splitting file')
    stfile.Split('<13><10>')                        ! Records end with cr lf
    ncount = stfile.records()                       ! Count the total number of records  
    UD.Debug('ImportA3 - ' & ncount & ' records')        
    i = 0
    j = 0
    LOOP    ! Loop
        i = i + 1                                   ! Current line number
        stline.SetValue (stfile.GetLine(i))         ! Extract the line
        count = ncount-i                            
        DISPLAY(?Count)                             ! Update the countdown display
        stline.split(',','"','"',true)              ! Get all the CSV fields from the line
        ! Load the values into the relevant ariana3 fields
        A3:section_code_original = stline.GetLine(1)                ! LONG !Section Code Original
        A3:section_name        = stline.GetLine(2)                  ! STRING(200) !Section Name

This carries on for a bit while all the fields are populated. Then we do the 5 TryInsert commands

if A3:brand3_id_original > 0                ! Do we have a value?
    if Access:a3_1level.TryInsert() = Level:Benign ! Try to insert the a3_1level record
                    !UD.Debug('ImportA3 - AL1 ' & CLIP(AL1:section_name))
    end
    if Access:a3_2level.TryInsert() = Level:Benign ! Try to insert the a3_2level record
                    !UD.Debug('ImportA3 -  AL2 ' & CLIP(AL2:category_name))
    end
    if Access:a3_3level.TryInsert() = Level:Benign ! Try to insert the mp_3level record
                    !UD.Debug('ImportA3 -   AL3 ' & CLIP(AL3:sub_category_name))
    end                    
    if Access:a3_advertiser.TryInsert() = Level:Benign ! Try to insert the a3_advertiser record
                    !UD.Debug('ImportA3 + A3A ' & CLIP(A3A:advertiser_name))
    end
                !UD.Debug('ImportA3 - A3 ' & CLIP(A3:product4_name))
    if Access:a3_prod.TryInsert()  = Level:Benign ! Try to insert the record
        j = j + 1                           ! Success
    else
                    !UD.Debug('ImportA3 insert failed: ' & CLIP(stline.GetValue()))
        UD.Debug('ImportA3 a3_prod insert failed: ' & A3:section_code_original & ',' & CLIP(A3:product_name))
    end
else
    UD.Debug('ImportA3 bad data: ' & CLIP(stline.GetValue()))
end

Then the loop is closed:

        IF j % 1000 = 0                             ! Every 1000 records
            ut = utelapsed / (i * 1.0)              ! seconds per record
            UD.Debug('ImportA3 - ' & i & ' records, ' & ut & ' seconds/record')
        END                                         ! Every 1000 records                 
    UNTIL i >= ncount ! Loop                        ! Try the next record
    utelapsed = utnow - utstart                     ! Elapsed time 
    stETA = fmt.FormatValue(utelapsed,'@T4Z ')      ! Elapsed time in HH:mm:ss format
    UD.Debug('ImportA3 - ' & j & ' records imported out of ' & ncount & ', and ' & ncount-j & ' records dropped. Elapsed time: ' & stETA)
    CLOSE(Progress)
    MESSAGE(j & ' records imported out of ' & ncount & '<13><10>' & ncount-j & ' records dropped.<13><10>Elapsed time: ' & stETA)   ! Report the results
end ! A3.csv                                        ! All done
stline.Free()
stfile.Free()

So hopefully I have answered all the questions:
I did not try to get exclusive access, either in the TPS or ODBC version of the code.
I haven’t tried any transaction stuff.
I did not try to stop the program and restart it where I left off.

I am busy importing 241,486 records. The first 1,000 records averaged 0.012 seconds/record
By the time we got to 10,000 records it was 0.074 sec/record
at 70,000 records we are at 0.4965 sec/record
and by my calculations it will finish early on Wednesday morning. (I am writing this on Saturday evening) So far it has run for just over 12 hours. Do I just let it run or risk more delays in just getting the data inserted so I can carry on with the rest of the project?

I will try any suggestions if I have to run it again for any reason.

Logout for TPS speeds up writing a LOT. maybe 2 orders of magnitude. I’ve had a 30sec process reduce to sub second.
SQL Not so much. You’re sort of bound by the engine you’re using. Sometimes it’s Way quicker to import the raw data into a temp table and run SQL on it there.

Please can you elaborate? Where would I put it in my code?

From the cryptic “help” I know that LOGOUT and COMMIT work as a pair, but I don’t see how they work with TRYINSERT, only ADD. Is there an example I can look at? The buggy Clarion-generated Convert program doesn’t use COMMIT at all.

Also, if the TRYINSERT for one of the tables fails, does the entire transaction of all 5 TRYINSERTs fail?

after you have opened your files.

as Sean says, on tps it will greatly speed up your process…

logout(1, a3_prod, a3_advertiser, a3_3level, a3_2level, a3_1level)
if errorcode() then message(‘logout failed…’).

I tend to put the logout in a loop…

inTrans = false
loop 5 times
  logout(1,....)
  if ~errorcode()
    inTrans = true
    break
  end
end ! loop
if ~inTrans then message(<error handling>).

loop
  <process records>
end

if inTrans then commit.

as you are already doing some stuff every 1000 records you could commit and re-logout then or perhaps every 10,000 recs. Also do your DISPLAY(?Count) at that time (every 1000 or 10,000 recs) not on each record.

tryinsert will eventually call add.

but rather than doing that, another thing to try is to forget tryinsert and use Append instead. That adds the record without doing the key entries. Then when you are done you do a Build to create the keys.

also with tps, to clear your files at the start rather than

a3_prod{PROP:SQL}=‘DELETE FROM a3_prod’

you would use create() or perhaps empty()

cheers and hth

I see above that the driver is ODBC, what is the backend?

Donn mentioned he has done two versions:

TPS
Postgres

As Geoff (@vitesse) says after open. It starts a transaction where you commit or rollback.
Many people do 1000+ per commit maybe up to 10k?
ie

Loop
insert()
If count % 1000 = 0
Commit
logout()
end
end

Clarion locks the TPS files and seems to do the processing in memory so it’s way fast.
I did benchmarks a decade or more back on various backends vs TPS and for Inserts TPS was a leader. (updates and deletes are different :slight_smile: )

Okay, putting TPS aside, I would write a PG insert function and pass the CSV record values to it. PG considers all inserts done within a function as a single transaction and can return a value indicating success or not.

OK, if I transfer all the data from a CSV file into a TPS file, that’s all well and good, but how do I get the data from the TPS files into the PostgreSQL database? pgAdmin III only allows importing to tables from Text, CSV, or Binary.

FWIW I have tried using a COMMIT every 1000 records on the ODBC link, and it is still slowing down. It’s too early to say if it is slowing down less than before.

Update: I changed back to TPS files and the import took 5 minutes instead of 5 days. Now I just have to figure out how to get the data from a TPS file to PostgreSQL.

Use PGAdmin to write your PG insert function. The function should parse a CSV record string and do the insert to each appropriate PG table. The PG function will post an exception if any insert fails, thereby retaining relational integrity. I would consider this an opportunity to learn about calling PG functions from Clarion using a generic table.

Thanks for the advice, but I haven’t a clue what you mean by a “generic table” or how to write or call a PG function. And of course the Clarion “help” file is just full of useful examples.

Capture

OK I’ve re-read the thread again. I think your fastest route is to copy the info into a new temp SQL table.
That should be quick and easy. Just a Loop; Next();Insert();cycle
From there I’d write an SQL script to perform your processing. SQL is beautiful for set operations. ie doing the same thing to a heap of records.
You’ll find running 6 update commands to do bits on all the records quicker than clarion doing all bits for each record. Hope that makes sense.

A generic table in clarion is a table with a lot of strings and a /turbosql driver string. /turbosql just means this is not a real table don’t verify the fields.

It can be used for ad hoc Selects and such.
To call a function in postgres it’s table{prop:SQL}=‘SELECT TheFunction()’

To expand, assume you copy your A3prod to A3prodTmp on postgres
Then issue a heap of insrts like:

INSERT INTO a3_1Level (section_code_original,section_name,finished_task)
(Select section_code_original,section_name,’’ FROM A3ProdTmp);

etc for L2 L3 and Ad then similar for A3 Prod

1 Like

If your conversion is a one off, I certainly understand doing just about anything to get the job done.
If recurring, IMO, getting the relational logic out of Clarion is a good thing.

I was able to get this going after a few suggestions from others. I am 100% confident with your experience it will be even faster. It isn’t in the help.

Here is a rough example of a Clarion function and table definition that allows calling a PG function and returning the result. Once this is compiled in a your own application library (Clarion dll), it will be always be an easy call.

PgInt FILE,DRIVER(‘ODBC’,’/TURBOSQL=TRUE’),PRE(PgI),BINDABLE,THREAD
Record RECORD
IValue LONG
END
END

PgGetResult PROCEDURE(STRING pSql)
CODE
OPEN(PgInt)
CLEAR(PgI:RECORD)
PgInt{PROP:Sql} = pSql
IF FILEERRORCODE()
MESSAGE('FILEERROR(),‘PostgreSQL Function Error’)
RETURN(-1)
END
NEXT(PgInt)
CLOSE(PgInt)
RETURN(PgInt.IValue)

On the PG side, learning to write basic PG functions is easy with PgAdmin. Lots of web examples. Likewise, the use of NEW, OLD, TG_OP, and plpgsql was straight forward.

HTH, Douglas

2 Likes

I just discovered the main source of the slowdown: when I created the primary key indexes I accidentally enabled the AUTONUMBER, which caused every TryInsert to work, and I ended up with 5 tables each with 241,486 records! Even worse, none of the PK values were of any use whatsoever. Serious facepalm moment.

Keep in mind, for PG tables, autonumber in the DCT needs to be off.

1 Like

Thanks for the tip. :+1: Fortunately FM3 didn’t try to do an autonumber on the PG side, but I was still feeding PG garbage from the Clarion side. I didn’t notice it until I started looking at the actual data, because the PG data structure and indexes/constraints were exactly what I thought they should be.

The other thing I realized while adding the records to TPS files, is that I could create CSV files that exactly match my file structure, and as I add a new record to, say a3_2level, I could write a line with the same data into a3_2level.csv. Then I can use pgAdmin to import the CSV files quickly and easily into PG after I have converted the TPS structure to the ODBC equivalent.

All good so far. I will keep the PG function examples because I’m pretty sure that the next part of my project will require them.