Speeding up batch deletion in Clarion (5M TPS records)

Hi everyone,
i have a simple application to process two local TPS files (~5 million records each, 5 keys per file). I need to delete records based on a condition.
The application contains only this single procedure.
Here’s a simplified version of the loop:

Set(FIL:KeyKey)
Loop
    Next(MyFile)
    If Errorcode() then break.
    If Year(FIL:checkDate) < Year(Today())
        Delete(MyFile)
    End
End

This takes forever (you know) it’s clearly slow.
Any tips or other best practices to improve performance?

A few thoughts.

  1. Sometimes it’s easier to append good rows to a new file (then BUILD()) than it is to delete rows from an existing one.
  2. You’d definitely want to utilize batches of logout/commit or stream/flush, say every several thousand rows.
  3. If you don’t have to make RTL calls inside the loop, then don’t. e.g., the value of YEAR(TODAY()) could probably be assumed to be the same throughout the loop, so its value could be stored in a variable before the loop.
  4. In deletion loops, I like going backwards. It definitely helps speedwise with queues, and helps my brain with files :slight_smile:
3 Likes

Tanks @jslarve
That refreshed my tiny cells

        ThisYear = year(today())
        Stream(MyFile)
        Set(FIL:KEYKEY)
        loop 
              previous(MyFile)
              if errorcode() then break.
              count +=1
              if count % 100 = 0
                Stateline = count &' records read'
                display 
              end           
              if year(FIL:checkDate) < ThisYear
                delete(MyFile)
                delcount += 1
                Statedeleted = delcount &' records deleted'
                display
              end     
        end 
        Flush(MyFile)

i will try the Append+Build way and compare results.

Blockquote

DISPLAY is very very expensive time wise. If you have millions of rows, I’d display every thousand, maximum. It will be like night and day to get rid of most of those DISPLAY.

3 Likes

You are doing a DISPLAY every time you Delete so that will slow things a lot.

I’d change from SET(Key) to SET(File). My reason is the Delete changes the Index so I would hope it would be simpler and faster not to use an Index.

I would suggest code like this:

  ThisYear = year(today())
  Stream(MyFile)
  Set(MyFile)   !was Set(FIL:KEYKEY)
  loop
       NEXT(MyFile)         !Odd? previous(MyFile)
       if errorcode() then break.
       if year(FIL:checkDate) < ThisYear
         delete(MyFile)
         delcount += 1
!Not here --->  Statedeleted = delcount &' records deleted'
!Not here --->  display
       end
       count +=1
       if count % 1024 = 0 THEN   !Divide LONG by power of 2 better than % 1000 
         Statedeleted = delcount &' records deleted'
         Stateline    = count    &' records read'
         display
       end
  end
  Statedeleted = 'FLUSH '& delcount &' Deleted'
  DISPLAY        !In case Flush takes some time update Display
  Flush(MyFile)

Rather than calling the YEAR() function 5 million times consider calculating the Start of Year Date so it is a simple compare IF Long < Long

ThisYearJan1   LONG  !or DATE if that matches FIL:checkDate ?

  ThisYearJan1 = DATE(1, 1, Year(today()) )
...
       if FIL:checkDate < ThisYearJan1    !was year(FIL:checkDate) < ThisYear
          delete(MyFile)
...
3 Likes

I always use LOGOUT when doing batch runs. In my experience it makes things much faster.

3 Likes

Thank you everybody
i will you use a mix of all these tips.
@Niels_Larsen do you use just one Logout/Commit or many cycles of them?

Just one.
This method also ensures that you have more consistent data.

LOGOUT(1,tablexxx)
LOOP
    DELETE
    IF ERORCODE()
        Error = TRUE
        BREAK
    END
END
IF NOT error
  Commit
ELSE
  Rollback
END
2 Likes

With that many I would cycle every 5000. You might run into problems with the size.

1 Like

As well as what everyone else has mentioned, you say the file has 5 keys. If one starts with the CheckDate then no need to read all 5 million records.

I tend to use logout/commit rather than stream/flush but either should give a big speed boost.

1 Like

Its easy to show a Progress control. Its a nice visual indicator of the status to watch grow to 100%.

Just add a Progress to the Window with USE() as your Count. Set the Range High to the Records in the file then you don’t have to calculate a percent.

count           LONG 
ThisYearJan1    LIKE(FIL:checkDate)  !LONG or DATE
Window WINDOW('Purge MyFile'),AT(,,216,59)
        PROGRESS,AT(8,15,200,12),USE(count)  !<-- Add Progress
    END
...
  open(Window)
  ?count{PROP:RangeHigh} = Records(MyFile)    !Progress 100% count 
  ThisYearJan1 = DATE(1, 1, Year(today()) )   !calc 1/1/Year Date 
  Stream(MyFile)
  Set(MyFile)   !was Set(FIL:KEYKEY)
  loop
       NEXT(MyFile)
       if errorcode() then break.
       if FIL:checkDate < ThisYearJan1    !was if year(FIL:checkDate) < ThisYear
         delete(MyFile)
         delcount += 1
       end
       count +=1
       if count % 4096 = 0 then   !Divide LONG by power of 2 better than % 1000 
          Statedeleted = delcount &' Records Deleted ' & |
                             INT(delcount/count*100) &'%'  !Show % Deleted
          Stateline    = count    &' Records Read'
          display
        end
  end
  Statedeleted = 'FLUSH '& delcount &' Deleted'
  DISPLAY        !In case Flush takes some time update Display
  Flush(MyFile)

I added the % Deleted after the count as interesting.

4 Likes
  1. If the file has any MEMO fields then see if using NOMEMO(MyFile) speeds up the read statements. I assume NOMEMO can be used with DELETE, but I don’t know that I have ever done that.
  2. If you were accessing the .TPS across a network then I might mention using the IP driver, if you have it, with a custom server-side procedure.

1.No Memo fields
2.In this case it’s a maintenance stuff that will be done offline

Thank you Carl for this information! I didn’t know this but it makes sense. Learn something new everyday!

As this file has the THREAD parameter ON, I’m tented to get years in a Queue (2024, 2023,2022,…) and then initiate a threaded child process for each year.
is that safe to do?

  1. Is the “FIL:checkDate” field the first component of any of the five indexes? If so, use that to read subsets of the file.
  2. What is the percentage of records to delete? If deleting a high percentage then consider that earlier suggestion of APPENDing records to a new file for the records you want to keep and then issuing a BUILD when done.
  3. If this is going to be done “off line” then I’d open the file in exclusive mode, "OPEN(MyFile, 2h + 10h) and no, I would not use multiple threads.

No that’s not going to improve speed.

STREAM will Lock the File so only One Thread can access it.

Without Stream I’d guess you would have so much contention with the threads updating the 5 indexes to remove the records.

You would need to use LOGOUT on quite small batches (say 1000) to avoid complete lockout of all threads. STREAM is unable to be used unless the file is open exclusive, which means no threads.
Not sure if this would add anything but complexity.

In my testing, STREAM() had a beneficial effect on files opened read-only as well.