Author Topic: Looking for Software with this feature (Read 23903 times)

Shades · « **Reply #25 on:** July 05, 2015, 10:17 AM »

I'm wondering why file size matters, because wouldn't the date be off by even a few seconds if it's two different copies of a file? Even in a high speed automated "log.txt" or something updated and then aggressively backed up, do any of the options above change context if it doesn't need to know the file size (or maybe checksum, because for ex someone opens a text file and then say MS Word adds a line break it's now different.)
_______________________
-TaoPhoenix (July 03, 2015, 02:39 PM)

The OP refers to file "name, extension and size", but file size is generally an unreliable/imprecise basis for file comparison, whereas content (checksum) is pretty definitive as a data analysis tool.
You seem to have conflated "time" with "size", and yes, "time" is also an imprecise basis for file comparison - mainly because of the different and inconsistent time-stamping standards applied by different file operations and file management tools.
-IainB (July 03, 2015, 09:21 PM)

IainB is right about having a checksum from both files and comparing these to find out if the files are the same or not.

Unfortunately, it looks like xplorer2 uses CRC to generate these checksum values. The advantage of CRC checksums are that generating these is fast. Disadvantage is that these checksums are not always unique.

So these were replaced with MD5 hash values (which take a bit more time to generate) but nowadays these can also be tricked. Best option for now is to generate SHA-based hash values of files to identify these. But again, these take even longer to generate.

The method IainB suggests is the best method you can apply to identify if files are unique or not. CRC is better than nothing for this purpose, but not much. SHA is much better, but consumes a lot of computational resources, so if your system doesn't have much of that (readily) available...expect to wait long times.

MilesAhead · « **Reply #26 on:** July 05, 2015, 10:56 AM »

Especially if files are large, say > 1GB video as example, doing a hash of A then doing a hash of B to see if A = B could get very slow. A better method may be side by side

[Select]

;;pseudocode

hashA = hashB = initial_hash;
do
{
   hashA = hash(fileAbuffer,hashA);
   hashB = hash(fileBbuffer,hashB);
   if (hashA != hashB)
   return false;
  fillbuffer(fileA);
  fillbuffer(fileB);
} until(hash_done(fileA) || hash_done(fileB));
final compare here etc

iow cut off the comparison as soon as there is a difference

ednja · « **Reply #27 on:** July 06, 2015, 03:37 PM »

""Size" could be a potentially unreliable comparison, so I would recommend using "Content" instead."

Yes I was aware of this and was wondering if anyone would notice it and mention it.

It would be better to check the content and there are duplicate removal programs that do compare the content with checksums. I only mentioned size because that would atleast be better than my manually moving the files. After posting the question, I was thinking I should have included the content. When I'm moving files, most of them are of same size and I overwrite the files, but I often am moving too fast and when it gets to two files of different size, I overshoot and overwrite (so I lose one). Also, sometimes I've manually compared the content in two files of equal size and found sometimes that one file is corrupted and won't open, even though it has the same size. So for sure, it would be better to compare the content. In fact, as far as I'm concerned, this would make the program valuable. In the programs for duplicates removal, the programs do a scan and once done it lists all the duplicates. Then to remove them you have to manually select which ones to remove (or you can allow the program to select, but I haven't found one yet that can be trusted for this function. If there aren't many duplicates in the hard drive, then manually selecting the ones to remove isn't a big deal. But when having to remove hundreds or even thousands of duplicates, this can take many hours of work. An example of having such large numbers of duplicates is when people use hard drive backup programs and the programs keeps copying the same files over and over into different folders and mixing them up until the hard drive is full and the user doesn't know what to do. I don't use backup software, but I know of people that do and I've been asked to sort it out. I've done this in my brother's computer and it took me many hours.

I would modifiy the filters as follows:
Filter #1: If the file being moved has the same name, extension, size and content as a file already existing in the target folder, then overwrite the existing file.

Filter #2: If the file being moved has the same name and extension as a file already existing in the target folder, but the two files are different in size or have different content or both, then move the file, but keep both files.

ednja · « **Reply #28 on:** July 08, 2015, 05:02 AM »

I found this software called NiceCopier today. It's the closest so far to what I'm looking for, but still doesn't have enough automation. Plus, it doesn't compare the contents of the files. It's the only one so far that compares the size.
https://sourceforge....projects/nicecopier/

IainB · « **Reply #29 on:** July 08, 2015, 10:22 AM »

@ednja: This is your description of your revised filters, with the Opening Post filters inserted below each in the quotes, for comparison. I have highlighted the difference in the newer Filter description:

... I would modifiy the filters as follows:
Filter #1: If the file being moved has the same name, extension, size and content as a file already existing in the target folder, then overwrite the existing file.
Whereas the Opening Post says:
Filter #1: If the file being moved has the same name, extension and size as a file already existing in the target folder, then overwrite the existing file.
______________________________

Filter #2: If the file being moved has the same name and extension as a file already existing in the target folder, but the two files are different in size or have different content or both, then move the file, but keep both files.
Whereas the Opening Post says:
Filter #2: If the file being moved has the same name and extension as a file already existing in the target folder, but the two files are different in size, then move the file, but keep both files.
______________________________
-ednja (July 06, 2015, 03:37 PM)

Sorry, but I think at this point I must be missing something as I do not understand:
(a) Why you need to "overwrite the existing file." with the file in Source in Filter #1. It is a superfluous/redundant step. You need only to leave the Source file as-is (or delete it if it is not wanted), and leave the target file untouched.
(b) Why you persist in including the use of file size as a basis of comparison at all, when it would seem to be irrelevant (QED). It confuses the issue unnecessarily.

tomos · « **Reply #30 on:** July 08, 2015, 10:25 AM »

If there aren't many duplicates in the hard drive, then manually selecting the ones to remove isn't a big deal. But when having to remove hundreds or even thousands of duplicates, this can take many hours of work. An example of having such large numbers of duplicates is when people use hard drive backup programs and the programs keeps copying the same files over and over into different folders and mixing them up until the hard drive is full and the user doesn't know what to do.
-ednja (July 06, 2015, 03:37 PM)

yeah, in my experience, duplicate file finders still need a lot of work after the duplicates have been found. But maybe there's really good duplicate file finders out there (havent tried any in a few years).

[Could they/you just dump everything on an external drive and start from scratch?]
Relatives of mine want me to sort out their photos - they used big SD cards and didnt delete photos from their cameras, so they kept uploading the same photos again and again - often using different methods (so: different names as well as timestamps). That is a nightmare, gigabytes of duplicated material - and I have to admit, a nightmare I've been avoiding... so, if a solution is found here, I could give it a go as well ;-)

If you had a programme that would locate moved files - and do a bit comparison of files with the same size (but possibly different date &/or name) -
if you had that - you could just keep merging folders (or copying and deleting copied folders). At some stage you'd have to run a duplicate file finder on the remaining files. Syncovery could do the merging bit intelligently (finding moved files) - but not directly - you might have to sync both ways - and then delete one 'side'.
I might try that at the weekend with those photos I was talking about.

Syncovery uses MD5 checksums - but the option appears to do this on all the files - which would take a wet week to run.
The ideal, IMO, would be to compare file sizes, and only do bit comparisons of files that are the same size (regardless of the filename/date).

ednja · « **Reply #31 on:** July 08, 2015, 02:13 PM »

IainB, you aren't the first one on this thread to not understand why file size is an issue for me. I will explain my reason. I have several thousands of guitar pro tab files that I've downloaded and some I've created myself. They are tiny files less than 100kb each. The total size of all the files is something like about 5 or 6 gb Many of them are duplicated in different folders. I've already spent a few weeks and many, many hours on organizing them and trying different programs. The reason why I move and replace if the file is same size is that in my opinion it highly unlikely that two files with same name and size have different content, although it's possible. I want to organize them accurately, but I don't have the time, so I'm willing to take chances on losing some of them in the events that two or more files of the same size and name have different content. If I created an ordinary text file at one point and gave the file a name, then a few years later created a text file and just happend to give it the exact same name, creating a duplicate but in a different folder. The two files would most likely have different sizes but not necessarily, but their content is obviously very different. In this case, I wouldn't want to just overwrite one of the files with the other because the odds of the two having different content is too high for my liking. With the guitar pro files, a file name includes the name of the band and the name of the song. In my opinion, with two guitar pro tab files with same size and same name will have the same content. It's possible that someone might have edited the file slightly so that it still has the same size. There might be an important difference in the content, but I'm willing to take the chance because of time. If I were to remove all of the duplicates, the total size of all the files might be reduced to about 1gb. One option I have with windows is I could just move all of the files and keep all files with same name by automatically renaming them. This might not be so bad an idea. The total size wouldn't get reduced to about 1gb and would remain the same size (5 or 6gb) and nothing would be lost. But still, it would be nice to have a program that would give some options that I want and the ability to automate it for a hands free operation. For me, if it took a week to do the operation automatically without me even being there, that's not a problem because finally I'd be free to work on some of the many other things I have to work on. I hope you'll be able to understand my need for comparing file size now. If not, then maybe it's me who's missing something?

MilesAhead · « **Reply #32 on:** July 08, 2015, 02:45 PM »

I haven't tried FreeFileSync but it claims to let you create your own rules. Also it says it can sync depending on content, date or size. So, just reading the features(which is always dangerous) it would seem creating your own rule would just entail choosing name size and content being the same.

http://www.thewindow...rs-with-freefilesync

Edit: Note that I just downloaded and ran a scan with MBAM. It shows the installer has Open Candy.

tomos · « **Reply #33 on:** July 08, 2015, 03:27 PM »

I haven't tried FreeFileSync but it claims to let you create your own rules. Also it says it can sync depending on content, date or size. So, just reading the features(which is always dangerous) it would seem creating your own rule would just entail choosing name size and content being the same.

http://www.thewindow...rs-with-freefilesync

Edit: Note that I just downloaded and ran a scan with MBAM. It shows the installer has Open Candy.
-MilesAhead (July 08, 2015, 02:45 PM)

it's okay as long as you carefully check options. See also IainB's review here on dc:
https://www.donation...ex.php?topic=29999.0

MilesAhead · « **Reply #34 on:** July 08, 2015, 03:53 PM »

^^^ thanks. I just blew up my system yesterday so I don't want to restore an image again today. Paranoia mode.

TaoPhoenix · « **Reply #35 on:** July 08, 2015, 04:52 PM »

IainB, you aren't the first one on this thread to not understand why file size is an issue for me. I will explain my reason. I have several thousands of guitar pro tab files that I've downloaded and some I've created myself. They are tiny files less than 100kb each. The total size of all the files is something like about 5 or 6 gb Many of them are duplicated in different folders. I've already spent a few weeks and many, many hours on organizing them and trying different programs. The reason why I move and replace if the file is same size is that in my opinion it highly unlikely that two files with same name and size have different content, although it's possible. I want to organize them accurately, but I don't have the time, so I'm willing to take chances on losing some of them in the events that two or more files of the same size and name have different content. If I created an ordinary text file at one point and gave the file a name, then a few years later created a text file and just happend to give it the exact same name, creating a duplicate but in a different folder. The two files would most likely have different sizes but not necessarily, but their content is obviously very different. In this case, I wouldn't want to just overwrite one of the files with the other because the odds of the two having different content is too high for my liking. With the guitar pro files, a file name includes the name of the band and the name of the song. In my opinion, with two guitar pro tab files with same size and same name will have the same content. It's possible that someone might have edited the file slightly so that it still has the same size. There might be an important difference in the content, but I'm willing to take the chance because of time. If I were to remove all of the duplicates, the total size of all the files might be reduced to about 1gb. One option I have with windows is I could just move all of the files and keep all files with same name by automatically renaming them. This might not be so bad an idea. The total size wouldn't get reduced to about 1gb and would remain the same size (5 or 6gb) and nothing would be lost. But still, it would be nice to have a program that would give some options that I want and the ability to automate it for a hands free operation. For me, if it took a week to do the operation automatically without me even being there, that's not a problem because finally I'd be free to work on some of the many other things I have to work on. I hope you'll be able to understand my need for comparing file size now. If not, then maybe it's me who's missing something?

-ednja (July 08, 2015, 02:13 PM)

Ednja, I'm starting to think sideways (and I'll let my betters fill in details!). But it's sounding like it's "just your music files" and not some dangerous regulatory compliance environment etc.

So maybe you're in the mood/market for something "ugly and fast"? Aka with "small" risks to your data, you value your own time and just want "most of an answer"?

Just as a thought experiment, try daydreaming about this for a few minutes:

1. What are the "limiting factors"? What if disk drive space isn't one? You could afford to have a process with a couple of "clunky" middle parts that isn't aesthetic at all, but if it's fast and the final output is pretty good, why not?

2. (Paraphrased) "Occasionally I'll have files with the same name but dramatically different content". This is more common than your post might suggest - people can call 8 files "test1"! (Or whatever).

3. So you make something like a 4 stage folder system.
Folder 0 is your original that you have to be kinda careful on.
Folder 1 is your first "processing folder". Your first step is "copy and force auto rename on any name duplicates". So two copies of "test1" become "test1" and "test1 ver2" or something. So folder 1 is guaranteed to have no name match hits.

4. Run (prob a diff program?) on Folder 1 to force the suspicious files to the end. A trick I used (manually) is you stick a Z in front of the file name with "threatened duplicates" then resort the file alphabetically (often just four clicks in a window). So then all the files without Z's are "clean files" and you just copy them over into Folder 2 En Masse and they should be fine. Then you copy over the "Z files" knowing you messed up the names on them and one day when you have the energy you can listen to them and give them better legit names, and then they go into Folder 2 with "legit names" rather than "Idea1" etc.

5. Your choice of other things to do with Folder 2, and whatever makes it into Folder 3 is your curated clean file list!

ednja · « **Reply #36 on:** July 09, 2015, 12:10 AM »

Well, if it were the guitar pro tabs that I created myself, I would never take such chances on them, just to let you know that I understand the risk of such software.

When I signed up for this forum a couple days ago, I didn’t realize that the members in the discussions are programmers. I’m not a programmer. I’ve only done some machine language programming for the Z80 when I was in college.

Because the issue has been brought up though, about the risk involved in such software, and I’ve had some time to think about it, I’m thinking now that maybe such a software wouldn’t be a good idea. That leads me to a different requirement then, thanks to all the input.

I would require the software to:

1. Have an automated Move function in which I can select options to make it compare the contents of the files to be moved with the contents of all the files in the target folder. Then all files with unique contents would be saved, and if more than one file has the same content, then only one would be saved. Renaming would be done in any event where two or more files to be saved have the same name.
2. The software would also have a Remove Duplicates function where it can scan any selection of files and folders for files with the same contents, renaming files when necessary and removing duplicates.

I don’t see this software coming into existence in the near future, so in the meantime, I’m going to just use the Windows 7 function of automatically keeping all of the files, regardless of size, and renaming all of the same name duplicates.

IainB · « **Reply #37 on:** July 09, 2015, 03:35 PM »

@ednja: Thankyou for explaining about why you consider that file size is an issue for you, and for describing the nature of your population of files (Guitar Pro tab files). I downloaded one to examine the contents, which looks to be a mixture of ASCII and binary data.

From what you say, I think you may misunderstand why size is largely irrelevant and best not used for comparing files.
Take a hypothetical example, where you have 2 files composed of just ASCII text:
File #1 contains only the character string "ABCDE"
File #2 contains only the character string "VWXYZ"

Since the binary data values that represent a single ASCII character are fixed at 8bits (1byte) in length, then the total file size in each case would be the same - i.e., 40bits, or 5 bytes.
So the file sizes will be the same, though the data contained is not the same.

However, a checksum (which is a hashing calculation) of the binary values of each file will be very different.
File size is simply a measure of how many bytes there are in a file and how much space the file occupies on disk.
Equal file sizes would indicate correlation, but would not otherwise tell you anything useful about the actual contents of the file - e.g., whether there is a high probability that the contents of those files are identical.

A file checksum is a number that represents a nearly unique mathematical value of the hashed sum of the contents of the file. In xplorer² it is a good guide as to uniqueness as it shows this numeric "summary" of a file's contents:

If the checksums of two files are different, then the files are definitely different, even though they may have the same file size.
However, if the checksums are equal, then this would imply a high probability that the files are identical (though this is not absolutely certain), regardless of the file sizes. If the file sizes were also equal, then this could possibly be used to augment the user's confidence level that the files were identical, but it would carry no statistical certainty that this was in fact the case.

So, going back to my points from above:

...Sorry, but I think at this point I must be missing something as I do not understand:
(a) Why you need to "overwrite the existing file." with the file in Source in Filter #1. It is a superfluous/redundant step. You need only to leave the Source file as-is (or delete it if it is not wanted), and leave the target file untouched.
(b) Why you persist in including the use of file size as a basis of comparison at all, when it would seem to be irrelevant (QED). It confuses the issue unnecessarily.
-IainB (July 08, 2015, 10:22 AM)

- and, unless I am mistaken:

Point (a) remains valid.
Point (b) remains valid.

Thus, what you would seem to have is a relatively straightforward and conventional backup problem requiring no special software (as you seem to think it might need).

Suggestions:

I had been awaiting your explaining where I was "missing" something before suggesting that you could do worse than use the excellent FreeFileSync, but @MilesAhead and @tomos have since covered that in comments above.
Also, since your disk space requirements for this data are not all that great, and if backup space is not a problem, then I would suggest that you avoid hastily taking the potentially irretrievable step of automating the deletion of your duplicate files and consider using the versioning feature of FreeFileSync, which would enable you to preserve all the data in an organised fashion and using customised backup rules (as suggested by @MilesAhead), and you could then take your time sifting and organising it (say) along the lines suggested by @TaoPhoenix, before deleting anything.
If/when you have backed everything up to backup hard drives or CDs, another really handy tool that might be of use to you could be VisualCD. You can download it here.
Since your most significant file metadata is apparently contained in the filename, then VisualCD would take relatively little time to scan and index the backup drives. (Full metadata would take a lot longer.) I use it this way all the time, and keep the VisualCD indexes on my laptop where they are very quickly searched, and VisualCD will even open the actual file on the archive device/media, if connected.
If space were an issue, then you might like to consider compressing (e.g., .ZIP) large groups of the files to see whether significant space-savings were possible, or using native Windows disk compression in NTFS.
You could consider dumping all the files into a database using a simple tool - e.g., (maybe) Excel, Access or GS-Base. For the latter, refer Database System GS-Base Is Easy as a Spreadsheet--And the Price is a Steal. That review is a bit old. I am playing with a newer version of GS-Base that I bought from BitsDuJour for USD9.95, with the idea of using it as a backup database for certain types of files. As a backup repository, all the files would be stored as objects in records. Having duplicate filenames in such a repository would not cause any problems.

Hope this helps or is of use, and always assuming that I have got things right.

ednja · « **Reply #38 on:** July 09, 2015, 04:09 PM »

IainB, thanks for your continued explanation of why size isn't a concern. I believe you. I saved copies of all the original files, so I will restart the organizing process at a later date. The files don't take up much space anyway, so they can just stay as is for now. Right now I'm very busy. It's summer time and I have a huge problem on my sundeck and have to repair it and repaint it, which involves removing everything off the deck, taking to a rented storage unit because my rented apartment where I live is too cluttered and no room to stack everything in the living room. It all has to be done before the usual rainy days begin in August. However, I will continue to look at replies here when I can. For me it wasn't a waste of time posting my question here because you and others did help me to change my view about the size. Also, I am going to check out all the software that you and others have suggested. Thank you all for you're time.

I will check out the software during the evening when it's too late to make noise on the deck. One other things is that I suffer from severe depression, anxiety and insomnia which resulted after more than 10 years of continuous trauma inflicted on me by the Family Court Mafia. So for a large portion of the day, I'm very tired, fatiqued and slow thinking.

I might have a future project where I program my arduino to play my guitar while I pretend to drink beer.

IainB · « **Reply #39 on:** July 10, 2015, 03:08 AM »

Good luck with the wooden deck repairs. No fun. I prefer tannalised timber or concrete (low maintenance).

kyrathaba · « **Reply #40 on:** July 10, 2015, 01:08 PM »

Free File Sync will meet criterion #2 if you choose the "keep revisions" option.

kyrathaba · « **Reply #41 on:** July 10, 2015, 01:09 PM »

Also, it's worth mentioning that you can setup Free File Sync to backup files or directories/subdirectories to an FTP server. Not well-documented, though.

zenzai · « **Reply #42 on:** September 16, 2015, 04:32 AM »

I would require the software to:

1. Have an automated Move function in which I can select options to make it compare the contents of the files to be moved with the contents of all the files in the target folder. Then all files with unique contents would be saved, and if more than one file has the same content, then only one would be saved. Renaming would be done in any event where two or more files to be saved have the same name.
2. The software would also have a Remove Duplicates function where it can scan any selection of files and folders for files with the same contents, renaming files when necessary and removing duplicates.
-ednja (July 09, 2015, 12:10 AM)

You could try this one, trial is fully functional:

http://www.duplicate-file-detective.com/

The 64 bit version is very fast, it does a SHA1 compare of content (file names ignored) of 5 GB files in less than 10 minutes on my system.

Start with searching for duplicates based on content only (SHA1 compare), ignoring filenames, that will find all identical files. Then choose to delete all files but one in all duplicate sets. If some files in a duplicate set have different file names you can choose the one you want and delete the rest.

Then you can do a search on identical file names only, that will find files with same file name but different content. I imagine that aren't many of these, in that case you can rename them manually from within the program. You can also move selected files to selected folders, and all kinds of things. See attached screenshot.

Author Topic: Looking for Software with this feature (Read 23903 times)

Shades

Re: Looking for Software with this feature

MilesAhead

Re: Looking for Software with this feature

ednja

Re: Looking for Software with this feature

ednja

Re: Looking for Software with this feature

IainB

Re: Looking for Software with this feature

tomos

Re: Looking for Software with this feature

ednja

Re: Looking for Software with this feature

MilesAhead

Re: Looking for Software with this feature

tomos

Re: Looking for Software with this feature

MilesAhead

Re: Looking for Software with this feature

TaoPhoenix

Re: Looking for Software with this feature

ednja

Re: Looking for Software with this feature

IainB

Re: Looking for Software with this feature

ednja

Re: Looking for Software with this feature

IainB

Re: Looking for Software with this feature

kyrathaba

Re: Looking for Software with this feature

kyrathaba

Re: Looking for Software with this feature

zenzai

Re: Looking for Software with this feature