Looking for Software with this feature

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

<< < (8/9) > >>

TaoPhoenix:
IainB, you aren't the first one on this thread to not understand why file size is an issue for me. I will explain my reason. I have several thousands of guitar pro tab files that I've downloaded and some I've created myself. They are tiny files less than 100kb each. The total size of all the files is something like about 5 or 6 gb Many of them are duplicated in different folders. I've already spent a few weeks and many, many hours on organizing them and trying different programs. The reason why I move and replace if the file is same size is that in my opinion it highly unlikely that two files with same name and size have different content, although it's possible. I want to organize them accurately, but I don't have the time, so I'm willing to take chances on losing some of them in the events that two or more files of the same size and name have different content. If I created an ordinary text file at one point and gave the file a name, then a few years later created a text file and just happend to give it the exact same name, creating a duplicate but in a different folder. The two files would most likely have different sizes but not necessarily, but their content is obviously very different. In this case, I wouldn't want to just overwrite one of the files with the other because the odds of the two having different content is too high for my liking. With the guitar pro files, a file name includes the name of the band and the name of the song. In my opinion, with two guitar pro tab files with same size and same name will have the same content. It's possible that someone might have edited the file slightly so that it still has the same size. There might be an important difference in the content, but I'm willing to take the chance because of time. If I were to remove all of the duplicates, the total size of all the files might be reduced to about 1gb. One option I have with windows is I could just move all of the files and keep all files with same name by automatically renaming them. This might not be so bad an idea. The total size wouldn't get reduced to about 1gb and would remain the same size (5 or 6gb) and nothing would be lost. But still, it would be nice to have a program that would give some options that I want and the ability to automate it for a hands free operation. For me, if it took a week to do the operation automatically without me even being there, that's not a problem because finally I'd be free to work on some of the many other things I have to work on. I hope you'll be able to understand my need for comparing file size now. If not, then maybe it's me who's missing something?

-ednja (July 08, 2015, 02:13 PM)
--- End quote ---

Ednja, I'm starting to think sideways (and I'll let my betters fill in details!). But it's sounding like it's "just your music files" and not some dangerous regulatory compliance environment etc.

So maybe you're in the mood/market for something "ugly and fast"? Aka with "small" risks to your data, you value your own time and just want "most of an answer"?

Just as a thought experiment, try daydreaming about this for a few minutes:

1. What are the "limiting factors"? What if disk drive space isn't one? You could afford to have a process with a couple of "clunky" middle parts that isn't aesthetic at all, but if it's fast and the final output is pretty good, why not?

2. (Paraphrased) "Occasionally I'll have files with the same name but dramatically different content". This is more common than your post might suggest - people can call 8 files "test1"! (Or whatever).

3. So you make something like a 4 stage folder system.
Folder 0 is your original that you have to be kinda careful on.
Folder 1 is your first "processing folder". Your first step is "copy and force auto rename on any name duplicates". So two copies of "test1" become "test1" and "test1 ver2" or something. So folder 1 is guaranteed to have no name match hits.

4. Run (prob a diff program?) on Folder 1 to force the suspicious files to the end. A trick I used (manually) is you stick a Z in front of the file name with "threatened duplicates" then resort the file alphabetically (often just four clicks in a window). So then all the files without Z's are "clean files" and you just copy them over into Folder 2 En Masse and they should be fine. Then you copy over the "Z files" knowing you messed up the names on them and one day when you have the energy you can listen to them and give them better legit names, and then they go into Folder 2 with "legit names" rather than "Idea1" etc.

5. Your choice of other things to do with Folder 2, and whatever makes it into Folder 3 is your curated clean file list!

ednja:
Well, if it were the guitar pro tabs that I created myself, I would never take such chances on them, just to let you know that I understand the risk of such software.

When I signed up for this forum a couple days ago, I didn’t realize that the members in the discussions are programmers. I’m not a programmer. I’ve only done some machine language programming for the Z80 when I was in college.

Because the issue has been brought up though, about the risk involved in such software, and I’ve had some time to think about it, I’m thinking now that maybe such a software wouldn’t be a good idea. That leads me to a different requirement then, thanks to all the input.

I would require the software to:

1. Have an automated Move function in which I can select options to make it compare the contents of the files to be moved with the contents of all the files in the target folder. Then all files with unique contents would be saved, and if more than one file has the same content, then only one would be saved. Renaming would be done in any event where two or more files to be saved have the same name.
2. The software would also have a Remove Duplicates function where it can scan any selection of files and folders for files with the same contents, renaming files when necessary and removing duplicates.

I don’t see this software coming into existence in the near future, so in the meantime, I’m going to just use the Windows 7 function of automatically keeping all of the files, regardless of size, and renaming all of the same name duplicates.

IainB:
@ednja: Thankyou for explaining about why you consider that file size is an issue for you, and for describing the nature of your population of files (Guitar Pro tab files). I downloaded one to examine the contents, which looks to be a mixture of ASCII and binary data.

From what you say, I think you may misunderstand why size is largely irrelevant and best not used for comparing files.
Take a hypothetical example, where you have 2 files composed of just ASCII text:
File #1 contains only the character string "ABCDE"
File #2 contains only the character string "VWXYZ"

Since the binary data values that represent a single ASCII character are fixed at 8bits (1byte) in length, then the total file size in each case would be the same - i.e., 40bits, or 5 bytes.
So the file sizes will be the same, though the data contained is not the same.

However, a checksum (which is a hashing calculation) of the binary values of each file will be very different.
File size is simply a measure of how many bytes there are in a file and how much space the file occupies on disk.
Equal file sizes would indicate correlation, but would not otherwise tell you anything useful about the actual contents of the file - e.g., whether there is a high probability that the contents of those files are identical.

A file checksum is a number that represents a nearly unique mathematical value of the hashed sum of the contents of the file. In xplorer² it is a good guide as to uniqueness as it shows this numeric "summary" of a file's contents:

* If the checksums of two files are different, then the files are definitely different, even though they may have the same file size.
* However, if the checksums are equal, then this would imply a high probability that the files are identical (though this is not absolutely certain), regardless of the file sizes. If the file sizes were also equal, then this could possibly be used to augment the user's confidence level that the files were identical, but it would carry no statistical certainty that this was in fact the case.
So, going back to my points from above:
...Sorry, but I think at this point I must be missing something as I do not understand:

* (a) Why you need to "overwrite the existing file." with the file in Source in Filter #1. It is a superfluous/redundant step. You need only to leave the Source file as-is (or delete it if it is not wanted), and leave the target file untouched.
* (b) Why you persist in including the use of file size as a basis of comparison at all, when it would seem to be irrelevant (QED). It confuses the issue unnecessarily.-IainB (July 08, 2015, 10:22 AM)
--- End quote ---

- and, unless I am mistaken:

* Point (a) remains valid.
* Point (b) remains valid.
Thus, what you would seem to have is a relatively straightforward and conventional backup problem requiring no special software (as you seem to think it might need).

Suggestions:

* I had been awaiting your explaining where I was "missing" something before suggesting that you could do worse than use the excellent FreeFileSync, but @MilesAhead and @tomos have since covered that in comments above.
* Also, since your disk space requirements for this data are not all that great, and if backup space is not a problem, then I would suggest that you avoid hastily taking the potentially irretrievable step of automating the deletion of your duplicate files and consider using the versioning feature of FreeFileSync, which would enable you to preserve all the data in an organised fashion and using customised backup rules (as suggested by @MilesAhead), and you could then take your time sifting and organising it (say) along the lines suggested by @TaoPhoenix, before deleting anything.
* If/when you have backed everything up to backup hard drives or CDs, another really handy tool that might be of use to you could be VisualCD. You can download it here.
* Since your most significant file metadata is apparently contained in the filename, then VisualCD would take relatively little time to scan and index the backup drives. (Full metadata would take a lot longer.) I use it this way all the time, and keep the VisualCD indexes on my laptop where they are very quickly searched, and VisualCD will even open the actual file on the archive device/media, if connected.
* If space were an issue, then you might like to consider compressing (e.g., .ZIP) large groups of the files to see whether significant space-savings were possible, or using native Windows disk compression in NTFS.
* You could consider dumping all the files into a database using a simple tool - e.g., (maybe) Excel, Access or GS-Base. For the latter, refer Database System GS-Base Is Easy as a Spreadsheet--And the Price is a Steal. That review is a bit old. I am playing with a newer version of GS-Base that I bought from BitsDuJour for USD9.95, with the idea of using it as a backup database for certain types of files. As a backup repository, all the files would be stored as objects in records. Having duplicate filenames in such a repository would not cause any problems.
Hope this helps or is of use, and always assuming that I have got things right.

ednja:
IainB, thanks for your continued explanation of why size isn't a concern. I believe you. I saved copies of all the original files, so I will restart the organizing process at a later date. The files don't take up much space anyway, so they can just stay as is for now. Right now I'm very busy. It's summer time and I have a huge problem on my sundeck and have to repair it and repaint it, which involves removing everything off the deck, taking to a rented storage unit because my rented apartment where I live is too cluttered and no room to stack everything in the living room. It all has to be done before the usual rainy days begin in August. However, I will continue to look at replies here when I can. For me it wasn't a waste of time posting my question here because you and others did help me to change my view about the size. Also, I am going to check out all the software that you and others have suggested. Thank you all for you're time.

I will check out the software during the evening when it's too late to make noise on the deck. One other things is that I suffer from severe depression, anxiety and insomnia which resulted after more than 10 years of continuous trauma inflicted on me by the Family Court Mafia. So for a large portion of the day, I'm very tired, fatiqued and slow thinking.

I might have a future project where I program my arduino to play my guitar while I pretend to drink beer.

IainB:
Good luck with the wooden deck repairs. No fun. I prefer tannalised timber or concrete (low maintenance).

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version