ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

New duplicate audio files finder... I am open to features suggestion

<< < (2/2)

...One of the things I am currently brainstorming right now is how to present the search results to the user, considering that same song can be found in two different files which are not exactly equal. That is, their audio data is not 100% exact but rather similar to some extent (eg. 90% or more due to different encoding options/noise in the background/quality of the recording; unlike regular duplicate files in DupeTrasher which are completely equal so grouping and presenting them to the user was much easier task to me). So basically, lets say that songX can be found in forms of fileA, fileB and fileC, where
- fileA and fileB are 90% similar
- fileA and fileC are 90%
- but fileB and fileC are only 70% smilar
The reason for this is due to probabilistic nature of non-exact comparing (that is I use percents and similarity rather than "exactly equal" or "not equal" as in DupeTrasher). Now the question I have is, should I include all three files in the same duplicate group? And what about if fileC is 90% similar to the fileD which is just 50% similar fileA and fileB? I hope you can see the issue here.
-AsmDev (November 15, 2009, 04:47 PM)
--- End quote ---

Yeah, I can see that could get a little complicated to present graphically. I was attempting to comment on possible grouping issues which occur when there are multiple matches among songs A,B,C,D, etc. Some folks might prefer to see a large group with A and B at X%, A and C at X%, C and D at X% , A and D...--that gets unwieldy rather quickly. Other mathematically challenged folks like me would rather see smaller distinct groups.

Regardless, as long as you have more than one data view (ie. large group view, small group view, list view, etc.) I'm definitely interested in your software. :)

Hi sajman99 and other guys,

After several days of brainstorming, I think I've finally got this grouping problem solved in a way that is (more or less) satisfactory. What I did is group all files that can be grouped (which are those sets of files that are all mutually duplicates) and those files that can't fit into any group just leave out and present them in non-grouped view. Here is more detailed explanation:

Warning: some amount of mental work will be reqired for reading of this post, if you are just trying to relax by reading DC forums skip to the last paragraph :)

Let's say that user performed a search with similarity treshold set to 90% or more (meaning only files that are 90% or more smiliar will be considered as duplicates) and let's say that 6 files are found with following matches:

- file1 and file2 are 90% similar
- file1 and file3 are 92% similar
- file2 and file3 are 95% smilar

- fileA and fileB are 90% similar
- fileA and fileC are 90% similar
- but fileB  and fileC are only 70% smilar

And fileX and fileY are 0% similar, where X is number {1,2,3}, and Y is a letter {A,B,C}. In other words, no red file is duplicate with some blue file and vice versa.

Now in order to group these files, they need to be at least 90% similar (as set in the treshold option). Well, for file1, file2 and file3 this is not a problem, they are all 90% similar to each other (thus they are all mutually duplicates) so they can fit into one group. Fine, we have group#1  which I will call blue group.

Now, what about red group? Obviously not all 3 red files can fit into it because some of them are only 70% similar (which is lower than user's defined treshold). So I will have to leave one file out, since I can group fileA and fileB (leaving fileC out) OR fileA and fileC (leaving fileB out). Well to the software it is arbirtary which files should be grouped, so let's say it groups fileA and fileB into group#2.

Now the result is that we have group#1 and group#2 in the group view + fileC left out which will be reported to the user as "the file that could not be grouped because it belongs to more than one group" and will be handled separately.

Complicated? It probably sounds so, but I hope it will be more clear when I post some screenshots soon, to ilustrated these ideas. (working on the GUI at the moment).
Well this is the best solution I could figure out for this issue. If someone has actually understood all this mess I wrote and then came up to the better solution I'd love to hear it :)

At the end of the day all this inner working really doesn't matter to the end user as long as he/she is able to have clear instant idea of how to use search results effectively, once they are presented. And that is is something I am trying to accomplish here - the easy of use. Grouping duplicate audio files, though somewhat complicated to implement, is very useful feature. It should group different files that all represent the same song and this will greatly help user to figure out which dupes should be removed.


[0] Message Index

[*] Previous page

Go to full version