Speeding up batch deletion in Clarion (5M TPS records)

jwabet · July 8, 2025, 2:41pm

Hi everyone,
i have a simple application to process two local TPS files (~5 million records each, 5 keys per file). I need to delete records based on a condition.
The application contains only this single procedure.
Here’s a simplified version of the loop:

Set(FIL:KeyKey)
Loop
    Next(MyFile)
    If Errorcode() then break.
    If Year(FIL:checkDate) < Year(Today())
        Delete(MyFile)
    End
End

This takes forever (you know) it’s clearly slow.
Any tips or other best practices to improve performance?

jslarve · July 8, 2025, 2:49pm

A few thoughts.

Sometimes it’s easier to append good rows to a new file (then BUILD()) than it is to delete rows from an existing one.
You’d definitely want to utilize batches of logout/commit or stream/flush, say every several thousand rows.
If you don’t have to make RTL calls inside the loop, then don’t. e.g., the value of YEAR(TODAY()) could probably be assumed to be the same throughout the loop, so its value could be stored in a variable before the loop.
In deletion loops, I like going backwards. It definitely helps speedwise with queues, and helps my brain with files

jwabet · July 8, 2025, 3:40pm

Tanks @jslarve
That refreshed my tiny cells

        ThisYear = year(today())
        Stream(MyFile)
        Set(FIL:KEYKEY)
        loop 
              previous(MyFile)
              if errorcode() then break.
              count +=1
              if count % 100 = 0
                Stateline = count &' records read'
                display 
              end           
              if year(FIL:checkDate) < ThisYear
                delete(MyFile)
                delcount += 1
                Statedeleted = delcount &' records deleted'
                display
              end     
        end 
        Flush(MyFile)

i will try the Append+Build way and compare results.

Blockquote

jslarve · July 8, 2025, 3:47pm

DISPLAY is very very expensive time wise. If you have millions of rows, I’d display every thousand, maximum. It will be like night and day to get rid of most of those DISPLAY.

CarlBarnes · July 8, 2025, 4:40pm

jwabet:

              if year(FIL:checkDate) < ThisYear
                delete(MyFile)
                delcount += 1
                Statedeleted = delcount &' records deleted'
                display
              end

You are doing a DISPLAY every time you Delete so that will slow things a lot.

I’d change from SET(Key) to SET(File). My reason is the Delete changes the Index so I would hope it would be simpler and faster not to use an Index.

I would suggest code like this:

  ThisYear = year(today())
  Stream(MyFile)
  Set(MyFile)   !was Set(FIL:KEYKEY)
  loop
       NEXT(MyFile)         !Odd? previous(MyFile)
       if errorcode() then break.
       if year(FIL:checkDate) < ThisYear
         delete(MyFile)
         delcount += 1
!Not here --->  Statedeleted = delcount &' records deleted'
!Not here --->  display
       end
       count +=1
       if count % 1024 = 0 THEN   !Divide LONG by power of 2 better than % 1000 
         Statedeleted = delcount &' records deleted'
         Stateline    = count    &' records read'
         display
       end
  end
  Statedeleted = 'FLUSH '& delcount &' Deleted'
  DISPLAY        !In case Flush takes some time update Display
  Flush(MyFile)

Rather than calling the YEAR() function 5 million times consider calculating the Start of Year Date so it is a simple compare IF Long < Long

ThisYearJan1   LONG  !or DATE if that matches FIL:checkDate ?

  ThisYearJan1 = DATE(1, 1, Year(today()) )
...
       if FIL:checkDate < ThisYearJan1    !was year(FIL:checkDate) < ThisYear
          delete(MyFile)
...

Niels_Larsen · July 9, 2025, 4:42am

I always use LOGOUT when doing batch runs. In my experience it makes things much faster.

jwabet · July 9, 2025, 8:51am

Thank you everybody
i will you use a mix of all these tips.
@Niels_Larsen do you use just one Logout/Commit or many cycles of them?

Niels_Larsen · July 9, 2025, 9:21am

Just one.
This method also ensures that you have more consistent data.

LOGOUT(1,tablexxx)
LOOP
    DELETE
    IF ERORCODE()
        Error = TRUE
        BREAK
    END
END
IF NOT error
  Commit
ELSE
  Rollback
END

seanh · July 10, 2025, 3:13am

With that many I would cycle every 5000. You might run into problems with the size.

vitesse · July 10, 2025, 9:55am

As well as what everyone else has mentioned, you say the file has 5 keys. If one starts with the CheckDate then no need to read all 5 million records.

I tend to use logout/commit rather than stream/flush but either should give a big speed boost.

CarlBarnes · July 10, 2025, 3:56pm

Its easy to show a Progress control. Its a nice visual indicator of the status to watch grow to 100%.

Just add a Progress to the Window with USE() as your Count. Set the Range High to the Records in the file then you don’t have to calculate a percent.

count           LONG 
ThisYearJan1    LIKE(FIL:checkDate)  !LONG or DATE
Window WINDOW('Purge MyFile'),AT(,,216,59)
        PROGRESS,AT(8,15,200,12),USE(count)  !<-- Add Progress
    END
...
  open(Window)
  ?count{PROP:RangeHigh} = Records(MyFile)    !Progress 100% count 
  ThisYearJan1 = DATE(1, 1, Year(today()) )   !calc 1/1/Year Date 
  Stream(MyFile)
  Set(MyFile)   !was Set(FIL:KEYKEY)
  loop
       NEXT(MyFile)
       if errorcode() then break.
       if FIL:checkDate < ThisYearJan1    !was if year(FIL:checkDate) < ThisYear
         delete(MyFile)
         delcount += 1
       end
       count +=1
       if count % 4096 = 0 then   !Divide LONG by power of 2 better than % 1000 
          Statedeleted = delcount &' Records Deleted ' & |
                             INT(delcount/count*100) &'%'  !Show % Deleted
          Stateline    = count    &' Records Read'
          display
        end
  end
  Statedeleted = 'FLUSH '& delcount &' Deleted'
  DISPLAY        !In case Flush takes some time update Display
  Flush(MyFile)

I added the % Deleted after the count as interesting.

RichCPT · July 11, 2025, 3:59am

If the file has any MEMO fields then see if using NOMEMO(MyFile) speeds up the read statements. I assume NOMEMO can be used with DELETE, but I don’t know that I have ever done that.
If you were accessing the .TPS across a network then I might mention using the IP driver, if you have it, with a custom server-side procedure.

jwabet · July 11, 2025, 6:56am

1.No Memo fields
2.In this case it’s a maintenance stuff that will be done offline

donridley · July 11, 2025, 1:05pm

Thank you Carl for this information! I didn’t know this but it makes sense. Learn something new everyday!

jwabet · July 11, 2025, 2:28pm

As this file has the THREAD parameter ON, I’m tented to get years in a Queue (2024, 2023,2022,…) and then initiate a threaded child process for each year.
is that safe to do?

RichCPT · July 11, 2025, 5:27pm

Is the “FIL:checkDate” field the first component of any of the five indexes? If so, use that to read subsets of the file.
What is the percentage of records to delete? If deleting a high percentage then consider that earlier suggestion of APPENDing records to a new file for the records you want to keep and then issuing a BUILD when done.
If this is going to be done “off line” then I’d open the file in exclusive mode, "OPEN(MyFile, 2h + 10h) and no, I would not use multiple threads.

CarlBarnes · July 11, 2025, 6:30pm

No that’s not going to improve speed.

STREAM will Lock the File so only One Thread can access it.

Without Stream I’d guess you would have so much contention with the threads updating the 5 indexes to remove the records.

seanh · July 12, 2025, 12:27am

You would need to use LOGOUT on quite small batches (say 1000) to avoid complete lockout of all threads. STREAM is unable to be used unless the file is open exclusive, which means no threads.
Not sure if this would add anything but complexity.

jslarve · July 12, 2025, 1:29am

In my testing, STREAM() had a beneficial effect on files opened read-only as well.