Messages - AsmDev [ switch to compact view ]

Pages: [1] 2 3 4next
1
Hi sajman99 and other guys,

After several days of brainstorming, I think I've finally got this grouping problem solved in a way that is (more or less) satisfactory. What I did is group all files that can be grouped (which are those sets of files that are all mutually duplicates) and those files that can't fit into any group just leave out and present them in non-grouped view. Here is more detailed explanation:

Warning: some amount of mental work will be reqired for reading of this post, if you are just trying to relax by reading DC forums skip to the last paragraph :)

Let's say that user performed a search with similarity treshold set to 90% or more (meaning only files that are 90% or more smiliar will be considered as duplicates) and let's say that 6 files are found with following matches:

- file1 and file2 are 90% similar
- file1 and file3 are 92% similar
- file2 and file3 are 95% smilar

- fileA and fileB are 90% similar
- fileA and fileC are 90% similar
- but fileB  and fileC are only 70% smilar

And fileX and fileY are 0% similar, where X is number {1,2,3}, and Y is a letter {A,B,C}. In other words, no red file is duplicate with some blue file and vice versa.

Now in order to group these files, they need to be at least 90% similar (as set in the treshold option). Well, for file1, file2 and file3 this is not a problem, they are all 90% similar to each other (thus they are all mutually duplicates) so they can fit into one group. Fine, we have group#1  which I will call blue group.

Now, what about red group? Obviously not all 3 red files can fit into it because some of them are only 70% similar (which is lower than user's defined treshold). So I will have to leave one file out, since I can group fileA and fileB (leaving fileC out) OR fileA and fileC (leaving fileB out). Well to the software it is arbirtary which files should be grouped, so let's say it groups fileA and fileB into group#2.

Now the result is that we have group#1 and group#2 in the group view + fileC left out which will be reported to the user as "the file that could not be grouped because it belongs to more than one group" and will be handled separately.

Complicated? It probably sounds so, but I hope it will be more clear when I post some screenshots soon, to ilustrated these ideas. (working on the GUI at the moment).
Well this is the best solution I could figure out for this issue. If someone has actually understood all this mess I wrote and then came up to the better solution I'd love to hear it :)


At the end of the day all this inner working really doesn't matter to the end user as long as he/she is able to have clear instant idea of how to use search results effectively, once they are presented. And that is is something I am trying to accomplish here - the easy of use. Grouping duplicate audio files, though somewhat complicated to implement, is very useful feature. It should group different files that all represent the same song and this will greatly help user to figure out which dupes should be removed.

2
Curt, If I understand you correctly, you would like to have an option that will add new music files to your collection only if they are not already present there? That is something I have also in mind since many users have sorted music collection and often they need to add new songs, so automated solution that will check if they already have those songs (with additional info like is it the better quality than the new file) would be useful


It seems to me that ít would be good to be able to easily differentiate between Exactly the Same & Similar - similar being where the percentages come in

Also, maybe an indication of what 90% similar could mean - could that be a different file (e.g. diff bitrate) but actually the same recording.
Ok thanks, this sounds reasonable. I will probably create several views as in DupeTrasher... some for exact dupes and similar ones with information on how they differ.

Another example - this may be unusual - I have a small collection but relatively a lot of live tracks - I often crop them at either end if there's too much waffling & naturally keep the original recordings. Would the app be able to tell me they exactly the same except one is cropped (this not really a feature request but you did ask for scenarios!)

Well in general it will be proof to the silence on the beginning of the track but I am not sure how it would handle stuff like this. Waffling and noise is part of the audio information and currently can't be treated separately. The audio detection is optimized for detecting same song in different qualities
However in scenarios like this I thik I can use supplementing features to identify duplicates (eg. fuzzy matching of the file names and ID3 tags). I'll do some tests and see how it goes...

I hope it will have sufficient options so the user can implement his/her own preferences. For example, if my preference is to start comparison at a similarity/tolerance of 60%, I should be able to do that without moving down from a higher pre-set level.
Yea sure, that will be included. In my testing so far I concluded that 90% or more match will identify the same song of different bitrate/samplerate/other quality parameters. In some cases, however, there is ~70% match if the encoding quality between two files of the same song is large (eg. flac and low bitrate wma). So definetely a must have feature.

Likewise, in observing matches I would like a choice as to how they are grouped--something like "show all matches" or "show most relevant matches". Regardless of the presentation of matches, the software will still have to perform the same amount of work (if I understand correctly), but the more presentation options, the better!

I am with you on this; the main reason for this thread is that I wanted to hear users how they would like to be presented with search data. It would help if you could describe this in more details, for example did you mean by "show most relevant matches" that program should show only matches that are 90% or higher?

Also, some type of (optional) cache management feature would be convenient so users don't have to start at ground zero every time they scan a large music collection.

That is already done  :Thmbsup:

3
Hello DC Users,

For me as a developer, I feel it is always good thing to ask users what they would like to see in an application before it is even released. So I am starting this thread, a similar to the one I did for DupeTrasher earlier this year here.

I am in the process of developing a new piece of software similar to my DupeTrasher duplicate file finder but this time a new app will be specialized for audio files only. Similarly, to the duplicate photo finders, this software should be able to find duplicate and similar audio files by comparing audio data in them (basically listening how they sound). In addition, it will use other parameters for duplicate detection like ID3 tags, name of the file and binary content, but those will be of secondary importance because that file data can be wrong and I want detection to be almost independent of them. With "audio listening", it should be able to recognize dupes even if they have different name, tag, bitrates, sample rate and file type (mp3/wma/ogg/flac...). It is designed specifically to resist lossy encodings but in some cases it should be able to even detect different performances of the same song (eg. live/album/remix).

Of course, this is just the first half of the problem which I have almost completed. The second and equally important is presenting information to the user so that he can decide what files should be removed in a shortest amount of time and with the least effort.

So feel free to post your general suggestions and requests (if you have any and if you are interested in software like this). If you have some common scenarios, where duplicate audio files are involved let me know so that I can analyze and find a solution for it, in the form of a feature that can make your life easier. Your ideas on graphic design and window layouts are also welcome.

One of the things I am currently brainstorming right now is how to present the search results to the user, considering that same song can be found in two different files which are not exactly equal. That is, their audio data is not 100% exact but rather similar to some extent (eg. 90% or more due to different encoding options/noise in the background/quality of the recording; unlike regular duplicate files in DupeTrasher which are completely equal so grouping and presenting them to the user was much easier task to me). So basically, lets say that songX can be found in forms of fileA, fileB and fileC, where
- fileA and fileB are 90% similar
- fileA and fileC are 90%
- but fileB and fileC are only 70% smilar
The reason for this is due to probabilistic nature of non-exact comparing (that is I use percents and similarity rather than "exactly equal" or "not equal" as in DupeTrasher). Now the question I have is, should I include all three files in the same duplicate group? And what about if fileC is 90% similar to the fileD which is just 50% similar fileA and fileB? I hope you can see the issue here.

The easiest way for me is to present a list of all files that have at least one duplicate and then let user to mark for delete the files he is interested in. So there would be no grouping of any kind just the plain list. But I don't think this is very coherent way of presenting search results of the duplicates search (I will probably include it just as one of the views though).


Of course, as always DC users will have special treatment for me. Beside discounts, I will provide free licenses for monthly newsletter, beta testers and other contributors.

Thanks! :)

4
When I first read about DupeTrasher, I swore not to try it! The reason is very simple and can be illustrated with a little screenshot from my True_Launchbar:
 (see attachment in previous post)
- and, yes, they are all fully licensed $hareware programs - edit except for the 'Nuker.

But why do I keep on purchasing this kind of programs? In fact, I didn't quite know the answer myself until I read patteo's post (Reply #18). What I am missing is speed (and ease) in the USE of the program (ANY program)! I acknowledge that DupeTrasher is a very fast scanner, but the examples given by patteo goes to show how to make the final result come out even faster yet.

Personally I have no idea how to write "genuine" command-line parameters, so I am bound to also ask for pre-written examples. As an example, if AsmDev is reading this and maybe even will try out the free Quizo QTTabbar with Vista's Explorer, then Vista+Quizo makes me able to add simple arguments like %f% and such (see next screenshot) to a shortcut to any program. I am however not convinced that these arguments are universal accepted.(?) At least, I think I have seen others use quite different arguments.
 (see attachment in previous post)


Hi Curt,

I will definitely try that software you suggested in order to gain more insight into problems you have. I think I will be able to develop some solution that will enable guys like patteo and you to automate common tasks and get the complex job done in a 2-3 clicks.

It would be helpful if you could tell me how common scenarios that you want to automate look like. For example, do you have a folder with master duplicates that you regularly compare against some other folders and want to remove files from them which are found in master folder?
Or when you plug in your USB flash would you like to have it compared against files you already have on your hard drive?

These two are just from the top of my head of what common scenarios people have when they want to get rid of dupes quickly.

5
Hi clmcveigh,

Yes I noticed a 30% order and I could link your name to your nick here. Thanks!

As for default search locations, you are right there are only hard drives set for search by default, but that is not a problem to update with other locations for the next version. I just didn't have any complains so far about it as users usually search only the hard drives. On the other hand it is also really easy to add any other location just by "drag and drop" (in the custom search mode).

Anyway, I am putting this on the TODO list so let me ask you a question regarding this: would you suggest that all available drives should be checked for the search by default?
This question applies for the "One click search" feature too, do you think all drives should be included or just leave it as it is now (only fixed drives)?
Note that network drives could make the whole process much much slower so this might annoy new user on the first impression.

It sure did find the dups on my hard-drive in no time flat. Will have to spend time learning how best to use all this program's functionality.

Well, I was trying to implement "don't make me think" approach for the users as much as possible, so hopefully you wont have to spend much time on learning but rather just "use it". :) I personally hate to learn how to use any "small" software. Once installed, I'd just like to do "less thinking and more doing" in order to get my job done.

Thanks



Pages: [1] 2 3 4next
Go to full version