Author Topic: New duplicate audio files finder... I am open to features suggestion (Read 7057 times)

AsmDev · « **on:** November 15, 2009, 04:47 PM »

Hello DC Users,

For me as a developer, I feel it is always good thing to ask users what they would like to see in an application before it is even released. So I am starting this thread, a similar to the one I did for DupeTrasher earlier this year here.

I am in the process of developing a new piece of software similar to my DupeTrasher duplicate file finder but this time a new app will be specialized for audio files only. Similarly, to the duplicate photo finders, this software should be able to find duplicate and similar audio files by comparing audio data in them (basically listening how they sound). In addition, it will use other parameters for duplicate detection like ID3 tags, name of the file and binary content, but those will be of secondary importance because that file data can be wrong and I want detection to be almost independent of them. With "audio listening", it should be able to recognize dupes even if they have different name, tag, bitrates, sample rate and file type (mp3/wma/ogg/flac...). It is designed specifically to resist lossy encodings but in some cases it should be able to even detect different performances of the same song (eg. live/album/remix).

Of course, this is just the first half of the problem which I have almost completed. The second and equally important is presenting information to the user so that he can decide what files should be removed in a shortest amount of time and with the least effort.

So feel free to post your general suggestions and requests (if you have any and if you are interested in software like this). If you have some common scenarios, where duplicate audio files are involved let me know so that I can analyze and find a solution for it, in the form of a feature that can make your life easier. Your ideas on graphic design and window layouts are also welcome.

One of the things I am currently brainstorming right now is how to present the search results to the user, considering that same song can be found in two different files which are not exactly equal. That is, their audio data is not 100% exact but rather similar to some extent (eg. 90% or more due to different encoding options/noise in the background/quality of the recording; unlike regular duplicate files in DupeTrasher which are completely equal so grouping and presenting them to the user was much easier task to me). So basically, lets say that songX can be found in forms of fileA, fileB and fileC, where
- fileA and fileB are 90% similar
- fileA and fileC are 90%
- but fileB and fileC are only 70% smilar
The reason for this is due to probabilistic nature of non-exact comparing (that is I use percents and similarity rather than "exactly equal" or "not equal" as in DupeTrasher). Now the question I have is, should I include all three files in the same duplicate group? And what about if fileC is 90% similar to the fileD which is just 50% similar fileA and fileB? I hope you can see the issue here.

The easiest way for me is to present a list of all files that have at least one duplicate and then let user to mark for delete the files he is interested in. So there would be no grouping of any kind just the plain list. But I don't think this is very coherent way of presenting search results of the duplicates search (I will probably include it just as one of the views though).

Of course, as always DC users will have special treatment for me. Beside discounts, I will provide free licenses for monthly newsletter, beta testers and other contributors.

Thanks!

Curt · « **Reply #1 on:** November 16, 2009, 04:08 AM »

I download several thousand audio files each month. Because of "Best of..", "The Very Best of..." I automatically gets a lot of duplicates. But these files have not yet been freed of DRM, so they are not yet placed in My Music, but in a special download folder. So one of my wishes for the program's navigator will be the option to Remember & Scan Favorite Folder other than My Music, please.

hmm... was this understandable?

Your questions were too difficult for me to answer.

tomos · « **Reply #2 on:** November 16, 2009, 08:01 AM »

I've not used any file duplicate apps (-yet, I recently bought DupeTrasher but havent had a chance to install yet) so any suggestions may vary wildly

It seems to me that ít would be good to be able to easily differentiate between Exactly the Same & Similar - similar being where the percentages come in

Also, maybe an indication of what 90% similar could mean - could that be a different file (e.g. diff bitrate) but actually the same recording.
-
Another example - this may be unusual - I have a small collection but relatively a lot of live tracks - I often crop them at either end if there's too much waffling & naturally keep the original recordings. Would the app be able to tell me they exactly the same except one is cropped (this not really a feature request but you did ask for scenarios!)

sajman99 · « **Reply #3 on:** November 16, 2009, 04:12 PM »

Good news, AsmDev. I look forward to your new audio comparison software.

I hope it will have sufficient options so the user can implement his/her own preferences. For example, if my preference is to start comparison at a similarity/tolerance of 60%, I should be able to do that without moving down from a higher pre-set level.

Likewise, in observing matches I would like a choice as to how they are grouped--something like "show all matches" or "show most relevant matches". Regardless of the presentation of matches, the software will still have to perform the same amount of work (if I understand correctly), but the more presentation options, the better!

Also, some type of (optional) cache management feature would be convenient so users don't have to start at ground zero every time they scan a large music collection.

Good Luck with development.

AsmDev · « **Reply #4 on:** November 18, 2009, 06:48 PM »

Curt, If I understand you correctly, you would like to have an option that will add new music files to your collection only if they are not already present there? That is something I have also in mind since many users have sorted music collection and often they need to add new songs, so automated solution that will check if they already have those songs (with additional info like is it the better quality than the new file) would be useful

It seems to me that ít would be good to be able to easily differentiate between Exactly the Same & Similar - similar being where the percentages come in

Also, maybe an indication of what 90% similar could mean - could that be a different file (e.g. diff bitrate) but actually the same recording.
-tomos (November 16, 2009, 08:01 AM)

Ok thanks, this sounds reasonable. I will probably create several views as in DupeTrasher... some for exact dupes and similar ones with information on how they differ.

Another example - this may be unusual - I have a small collection but relatively a lot of live tracks - I often crop them at either end if there's too much waffling & naturally keep the original recordings. Would the app be able to tell me they exactly the same except one is cropped (this not really a feature request but you did ask for scenarios!)
-tomos (November 16, 2009, 08:01 AM)

Well in general it will be proof to the silence on the beginning of the track but I am not sure how it would handle stuff like this. Waffling and noise is part of the audio information and currently can't be treated separately. The audio detection is optimized for detecting same song in different qualities
However in scenarios like this I thik I can use supplementing features to identify duplicates (eg. fuzzy matching of the file names and ID3 tags). I'll do some tests and see how it goes...

I hope it will have sufficient options so the user can implement his/her own preferences. For example, if my preference is to start comparison at a similarity/tolerance of 60%, I should be able to do that without moving down from a higher pre-set level.
-sajman99 (November 16, 2009, 04:12 PM)

Yea sure, that will be included. In my testing so far I concluded that 90% or more match will identify the same song of different bitrate/samplerate/other quality parameters. In some cases, however, there is ~70% match if the encoding quality between two files of the same song is large (eg. flac and low bitrate wma). So definetely a must have feature.

Likewise, in observing matches I would like a choice as to how they are grouped--something like "show all matches" or "show most relevant matches". Regardless of the presentation of matches, the software will still have to perform the same amount of work (if I understand correctly), but the more presentation options, the better!
-sajman99 (November 16, 2009, 04:12 PM)

I am with you on this; the main reason for this thread is that I wanted to hear users how they would like to be presented with search data. It would help if you could describe this in more details, for example did you mean by "show most relevant matches" that program should show only matches that are 90% or higher?

Also, some type of (optional) cache management feature would be convenient so users don't have to start at ground zero every time they scan a large music collection.
-sajman99 (November 16, 2009, 04:12 PM)

That is already done

sajman99 · « **Reply #5 on:** November 18, 2009, 07:55 PM »

...One of the things I am currently brainstorming right now is how to present the search results to the user, considering that same song can be found in two different files which are not exactly equal. That is, their audio data is not 100% exact but rather similar to some extent (eg. 90% or more due to different encoding options/noise in the background/quality of the recording; unlike regular duplicate files in DupeTrasher which are completely equal so grouping and presenting them to the user was much easier task to me). So basically, lets say that songX can be found in forms of fileA, fileB and fileC, where
- fileA and fileB are 90% similar
- fileA and fileC are 90%
- but fileB and fileC are only 70% smilar
The reason for this is due to probabilistic nature of non-exact comparing (that is I use percents and similarity rather than "exactly equal" or "not equal" as in DupeTrasher). Now the question I have is, should I include all three files in the same duplicate group? And what about if fileC is 90% similar to the fileD which is just 50% similar fileA and fileB? I hope you can see the issue here.
-AsmDev (November 15, 2009, 04:47 PM)

Yeah, I can see that could get a little complicated to present graphically. I was attempting to comment on possible grouping issues which occur when there are multiple matches among songs A,B,C,D, etc. Some folks might prefer to see a large group with A and B at X%, A and C at X%, C and D at X% , A and D...--that gets unwieldy rather quickly. Other mathematically challenged folks like me would rather see smaller distinct groups.

Regardless, as long as you have more than one data view (ie. large group view, small group view, list view, etc.) I'm definitely interested in your software.

AsmDev · « **Reply #6 on:** November 22, 2009, 02:28 PM »

Hi sajman99 and other guys,

After several days of brainstorming, I think I've finally got this grouping problem solved in a way that is (more or less) satisfactory. What I did is group all files that can be grouped (which are those sets of files that are all mutually duplicates) and those files that can't fit into any group just leave out and present them in non-grouped view. Here is more detailed explanation:

Warning: some amount of mental work will be reqired for reading of this post, if you are just trying to relax by reading DC forums skip to the last paragraph

Let's say that user performed a search with similarity treshold set to 90% or more (meaning only files that are 90% or more smiliar will be considered as duplicates) and let's say that 6 files are found with following matches:

- file1 and file2 are 90% similar
- file1 and file3 are 92% similar
- file2 and file3 are 95% smilar

- fileA and fileB are 90% similar
- fileA and fileC are 90% similar
- but fileB and fileC are only 70% smilar

And fileX and fileY are 0% similar, where X is number {1,2,3}, and Y is a letter {A,B,C}. In other words, no red file is duplicate with some blue file and vice versa.

Now in order to group these files, they need to be at least 90% similar (as set in the treshold option). Well, for file1, file2 and file3 this is not a problem, they are all 90% similar to each other (thus they are all mutually duplicates) so they can fit into one group. Fine, we have group#1 which I will call blue group.

Now, what about red group? Obviously not all 3 red files can fit into it because some of them are only 70% similar (which is lower than user's defined treshold). So I will have to leave one file out, since I can group fileA and fileB (leaving fileC out) OR fileA and fileC (leaving fileB out). Well to the software it is arbirtary which files should be grouped, so let's say it groups fileA and fileB into group#2.

Now the result is that we have group#1 and group#2 in the group view + fileC left out which will be reported to the user as "the file that could not be grouped because it belongs to more than one group" and will be handled separately.

Complicated? It probably sounds so, but I hope it will be more clear when I post some screenshots soon, to ilustrated these ideas. (working on the GUI at the moment).
Well this is the best solution I could figure out for this issue. If someone has actually understood all this mess I wrote and then came up to the better solution I'd love to hear it

At the end of the day all this inner working really doesn't matter to the end user as long as he/she is able to have clear instant idea of how to use search results effectively, once they are presented. And that is is something I am trying to accomplish here - the easy of use. Grouping duplicate audio files, though somewhat complicated to implement, is very useful feature. It should group different files that all represent the same song and this will greatly help user to figure out which dupes should be removed.

Author Topic: New duplicate audio files finder... I am open to features suggestion (Read 7057 times)

AsmDev

New duplicate audio files finder... I am open to features suggestion

Curt

Re: New duplicate audio files finder... I am open to features suggestion

tomos

Re: New duplicate audio files finder... I am open to features suggestion

sajman99

Re: New duplicate audio files finder... I am open to features suggestion

AsmDev

Re: New duplicate audio files finder... I am open to features suggestion

sajman99

Re: New duplicate audio files finder... I am open to features suggestion

AsmDev

Re: New duplicate audio files finder... I am open to features suggestion