@ednja: Thankyou for explaining about why you consider that file size is an issue for you, and for describing the nature of your population of files (Guitar Pro tab files). I downloaded one to examine the contents, which looks to be a mixture of ASCII and binary data.
From what you say, I think you may misunderstand
why size is largely irrelevant and best not used for comparing files.
Take a hypothetical example, where you have 2 files composed of just ASCII text:
File #1 contains only the character string "ABCDE"
File #2 contains only the character string "VWXYZ"
Since the binary data values that represent a single ASCII character are fixed at 8bits (1byte) in length,
then the total file size in each case would be the same - i.e., 40bits, or 5 bytes.
So the file sizes will be the same, though the data contained is not the same.
However, a checksum (which is a hashing calculation) of the binary values of each file will be very different.
File size is simply a measure of how many bytes there are in a file and how much space the file occupies on disk.
Equal file sizes would indicate correlation, but would not otherwise tell you anything useful about the actual
contents of the file - e.g., whether there is a high probability that the contents of those files are identical.
A file checksum is a number that represents a
nearly unique mathematical value of the hashed sum of the contents of the file. In xplorer² it is a good guide as to
uniqueness as it shows this numeric "summary" of a file's contents:
- If the checksums of two files are different, then the files are definitely different, even though they may have the same file size.
- However, if the checksums are equal, then this would imply a high probability that the files are identical (though this is not absolutely certain), regardless of the file sizes. If the file sizes were also equal, then this could possibly be used to augment the user's confidence level that the files were identical, but it would carry no statistical certainty that this was in fact the case.
So, going back to my points from above:
...Sorry, but I think at this point I must be missing something as I do not understand:
- (a) Why you need to "overwrite the existing file." with the file in Source in Filter #1. It is a superfluous/redundant step. You need only to leave the Source file as-is (or delete it if it is not wanted), and leave the target file untouched.
- (b) Why you persist in including the use of file size as a basis of comparison at all, when it would seem to be irrelevant (QED). It confuses the issue unnecessarily.
-IainB
- and, unless I am mistaken:
- Point (a) remains valid.
- Point (b) remains valid.
Thus, what you would seem to have is a relatively straightforward and conventional backup problem requiring no special software (as you seem to think it might need).
Suggestions:- I had been awaiting your explaining where I was "missing" something before suggesting that you could do worse than use the excellent FreeFileSync, but @MilesAhead and @tomos have since covered that in comments above.
- Also, since your disk space requirements for this data are not all that great, and if backup space is not a problem, then I would suggest that you avoid hastily taking the potentially irretrievable step of automating the deletion of your duplicate files and consider using the versioning feature of FreeFileSync, which would enable you to preserve all the data in an organised fashion and using customised backup rules (as suggested by @MilesAhead), and you could then take your time sifting and organising it (say) along the lines suggested by @TaoPhoenix, before deleting anything.
- If/when you have backed everything up to backup hard drives or CDs, another really handy tool that might be of use to you could be VisualCD. You can download it here.
- Since your most significant file metadata is apparently contained in the filename, then VisualCD would take relatively little time to scan and index the backup drives. (Full metadata would take a lot longer.) I use it this way all the time, and keep the VisualCD indexes on my laptop where they are very quickly searched, and VisualCD will even open the actual file on the archive device/media, if connected.
- If space were an issue, then you might like to consider compressing (e.g., .ZIP) large groups of the files to see whether significant space-savings were possible, or using native Windows disk compression in NTFS.
- You could consider dumping all the files into a database using a simple tool - e.g., (maybe) Excel, Access or GS-Base. For the latter, refer Database System GS-Base Is Easy as a Spreadsheet--And the Price is a Steal. That review is a bit old. I am playing with a newer version of GS-Base that I bought from BitsDuJour for USD9.95, with the idea of using it as a backup database for certain types of files. As a backup repository, all the files would be stored as objects in records. Having duplicate filenames in such a repository would not cause any problems.
Hope this helps or is of use, and always assuming that I have got things right.