Hi sajman99 and other guys,
After several days of brainstorming, I think I've finally got this grouping problem solved in a way that is (more or less) satisfactory. What I did is group all files that can be grouped (which are those sets of files that are
all mutually duplicates) and those files that can't fit into any group just leave out and present them in non-grouped view. Here is more detailed explanation:
Warning: some amount of mental work will be reqired for reading of this post, if you are just trying to relax by reading DC forums skip to the last paragraph Let's say that user performed a search with similarity treshold set to 90% or more (meaning only files that are 90% or more smiliar will be considered as duplicates) and let's say that 6 files are found with following matches:
-
file1 and
file2 are 90% similar
-
file1 and
file3 are 92% similar
-
file2 and
file3 are 95% smilar
-
fileA and
fileB are 90% similar
-
fileA and
fileC are 90% similar
- but
fileB and
fileC are only 70% smilar
And
fileX and
fileY are 0% similar, where X is number {1,2,3}, and Y is a letter {A,B,C}. In other words, no red file is duplicate with some blue file and vice versa.
Now in order to group these files, they need to be at least 90% similar (as set in the treshold option). Well, for
file1, file2 and file3 this is not a problem, they are all 90% similar to each other (thus they are all mutually duplicates) so they can fit into one group. Fine, we have
group#1 which I will call blue group.
Now, what about red group? Obviously not all 3 red files can fit into it because some of them are only 70% similar (which is lower than user's defined treshold). So I will have to leave one file out, since I can group
fileA and fileB (leaving fileC out) OR
fileA and fileC (leaving fileB out). Well to the software it is arbirtary which files should be grouped, so let's say it groups
fileA and fileB into
group#2.
Now the result is that we have
group#1 and
group#2 in the group view + fileC left out which will be reported to the user as "the file that could not be grouped because it belongs to more than one group" and will be handled separately.
Complicated? It probably sounds so, but I hope it will be more clear when I post some screenshots soon, to ilustrated these ideas. (working on the GUI at the moment).
Well this is the best solution I could figure out for this issue. If someone has actually understood all this mess I wrote and then came up to the better solution I'd love to hear it
At the end of the day all this inner working really doesn't matter to the end user as long as he/she is able to have clear instant idea of how to use search results effectively, once they are presented. And that is is something I am trying to accomplish here - the easy of use. Grouping duplicate audio files, though somewhat complicated to implement, is very useful feature. It should group different files that all represent the same song and this will greatly help user to figure out which dupes should be removed.