This is a little difficult to explain and currently i get it done through the use of several different programs.
One a number of old backup drives, i have multiple daily backups of entire directories of program files and folders along with the data used by them.
What i need is a way to extract all of the files of certain types and of those files, remove all but one (the newest). Everything else can be deleted leaving me with a folder full of nothing but a single copy of any specific file and it being the newest version.
the drives are all full (smallest is 500GB largest 1TB).
The files to be kept are specific extensions like .doc .docx .pdf etc. At this point, all of them will be document types, though the use of audio files is in the works. I have been doing this in steps using one extension at a time to find all .doc for example and move them move them all out to another folder.
The second step is removing the duplicates created when i do this as each file was backed up once each day sometimes for months making 60 copies of that file. Because the backups were done on each directory at different times i have to deal with each drive as a whole and cannot just find the newest copy of a single directory. Even if i could , i still need to be sure i only keep the last version of each file and this could end up being on a different drive.
In theory, for the reason they are kept, I do not need to keep the actual "path to the file" as: C:\a\b\c\d\filename. This same path including the filename exists once in every backup
But having that information could be of some use one day as it is possible that a given file could have been used in one project and then restarted in another. The project names and other information is in the path. But i was only asked to worry about the files themselves.
I have tried various duplicate removers with each offering some advantage but nothing i have found can do the whole thing in one step.
To make it even harder, each path would have probably 15 or more files in it and keeping the full path name attached to all the files would also be wasteful and cumbersome.
the ideal would be to end up with the latest versions of every file that exist in each path kept and all others discarded.
C:\a\b\c\d would end up with 1.doc, 2.pdf, 4.txt. 5.docx (example only most of the files are pdf's) That would preserver the path to give the logic of why the file was there to start with.
As it is, that same path including all those files exists multiple times and in most cases the files don't even change but in some cases they do or i would sort the whole mess by "date of path", keep the newest version of the data directory and be done with it.
However, doing it that way would also end up omitting a lot of documents that were deleted during the term of the project and they want to keep all that were ever in each one even if it as deleted during the term of the project.
As I said, the path is something i think will one day be an item they will wish they had kept but all i was asked to do it keep all the documents, just one big pile of them.
Thanks for any ideas. There are at least 20 more of these drives i have to reduce to the newest single copies of stored documents only. The rest all gets deleted and the drives reused. I am probably approaching this with tunnel vision and there must be an easier way.