Just wondering if anyone has ever developed an app to find duplicate files in a folder/subfolders and then provided an easy way to move or delete the duplicates.
I’m re-organising my photos folder and it is a mess of duplicates.
I have a couple of duplicate finder apps but they require selecting each duplicate individually before anything can be done. There are probably other apps that would allow me to choose a “keep” and “delete” folder but I thought I could probably write my own as a Christmas holiday pastime
I’m guessing that I would start with two queues, one with the folder and filenames and the other with same data but for any duplicates found. I’m only going to use the filename as the duplicate test (maybe add file size too??) I’m assuming there will be no duplicate names in a single folder, I don’t care if a file is a duplicate with a different name.
Greetings -
Start with a queue that contains a directory level. Pass the procedure a starting directory location and default that to level 1. Use a DIRECTORY call list to build the queue, listing files and directories. See what that calls returns to define your queue with directory level. If you got a directory in that level one, increase the level to 2 and go recursive until you get to no more directories. Now you got files, locations, in a nested directory tree.
Sort by file name and make choices as to how you want to proceed.
Start simple, build one level, then figure out the recursion. Be careful how deep you nest, queues will run out of memory.
Regards,
Roberto Artigas
In the queue I would include directory/filename, size, hash.
First populate the queue - recursing down the directory tree.
then sort queue on size.
read through queue and where there are two or more adjacent files of the same size then calculate a hash on each. This could be a CRC32 or md5. Code for both is readily available or use Clarion’s built-in CRC. Store the value in the hash field in the queue.
Any file which has a size different to those on each side can have the queue entry deleted (probably easiest to read the queue backwards to save re-positioning pointer).
When done then re-sort queue on +size,+hash and re-do the deleting (of the queue entries not the files!) but this time where size and hash combination is unique.
What are left now are most likely duplicate files. If you want to be extra careful you might want to do a byte-by-byte comparison before deleting anything. Collisions on hashes are rare but they do happen.
You can then decide whether the program deletes the duplicates or if you present a screen and get the user to decide which to delete.
Make sure you have a backup before you run the program just in case! In fact maybe dump a list of dups to a log file for verification before you do any real deletes.
cheers
Geoff R
#edit1
Not sure if you have StringTheory but that has MD5 (you need to tick the “Enable MD5” tickbox to include the C code).
so:
st.loadFile('someFile.txt')
hash = st.md5()
MD5 is 128 bit so less likely to have collisions than 32 bit CRC (but even on CRC it will be rare).
but if you want to be sure and do byte-by-byte comparison of two files:
st1.loadFile('someFile.txt')
st2.loadFile('anotherFile.txt')
if st1.equals(st2)
stop('same')
else
stop('different')
end
Thanks Roberto, fortunately it is a fairly flat folder structure.
I definitely want to be able to sort by filename to find the duplicates. I just need to get my head around the user interface (and data structure) to select which folder to remove the duplicates from, or maybe which folder to select as the “keeper” and delete duplicates from all other folders?
Thanks Geoff, I hadn’t thought of doing a hash on the files to check for uniqueness!
Other software has identified quite a few duplicates - about 3,700+ (blush!) so I’m not sure how efficient this is going to be in Clarion. It’s actually surprisingly fast in the other software. I’ll give it a try though. (I think I have ST, if not I definitely have some MD5 code somewhere)
I think I need a pencil and paper now to figure out how the UI will work.
Thanks Urayoan, I have Winmerge but I’m pretty sure it doesn’t do what I want - recursively find duplicates in a folder tree.
I’ve used Glary Utilities, which I use for other PC maintenance and it finds them OK, but doesn’t have a simple means of deleting duplicates by folder. https://www.glarysoft.com/
Hi Geoff - It should be quite fast. However when doing deletes, don’t use remove() but instead use API delete, as Carl demonstrated on this thread:
sacré bleu! If you have it then please start using it. And if not - then run out and get it for Santa to give you an early Christmas present. Once you start using it you will wonder why you didn’t start doing so years ago!
It will change the way you think and code. Whorf hypothesis and all that…
(Make sure you are using latest version - currently 3.70 - as there have been lots of improvements over the years.)
<end of advertisement/plug>
One other thought re choosing which folder is the “keeper”. When you have your list of duplicate files, you could populate another queue of unque folders which have duplicate files. The user gets a list and orders the folders in order of preference. So if a file is in two or more folders then it is retained in the highest ranking directory - the other copies of the file are deleted. This still leaves the issue of identical files in the same directory with different names. You might need the user to choose which one to keep in that case.