topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 6:06 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - AsmDev [ switch to compact view ]

Pages: [1]
1
Hi sajman99 and other guys,

After several days of brainstorming, I think I've finally got this grouping problem solved in a way that is (more or less) satisfactory. What I did is group all files that can be grouped (which are those sets of files that are all mutually duplicates) and those files that can't fit into any group just leave out and present them in non-grouped view. Here is more detailed explanation:

Warning: some amount of mental work will be reqired for reading of this post, if you are just trying to relax by reading DC forums skip to the last paragraph :)

Let's say that user performed a search with similarity treshold set to 90% or more (meaning only files that are 90% or more smiliar will be considered as duplicates) and let's say that 6 files are found with following matches:

- file1 and file2 are 90% similar
- file1 and file3 are 92% similar
- file2 and file3 are 95% smilar

- fileA and fileB are 90% similar
- fileA and fileC are 90% similar
- but fileB  and fileC are only 70% smilar

And fileX and fileY are 0% similar, where X is number {1,2,3}, and Y is a letter {A,B,C}. In other words, no red file is duplicate with some blue file and vice versa.

Now in order to group these files, they need to be at least 90% similar (as set in the treshold option). Well, for file1, file2 and file3 this is not a problem, they are all 90% similar to each other (thus they are all mutually duplicates) so they can fit into one group. Fine, we have group#1  which I will call blue group.

Now, what about red group? Obviously not all 3 red files can fit into it because some of them are only 70% similar (which is lower than user's defined treshold). So I will have to leave one file out, since I can group fileA and fileB (leaving fileC out) OR fileA and fileC (leaving fileB out). Well to the software it is arbirtary which files should be grouped, so let's say it groups fileA and fileB into group#2.

Now the result is that we have group#1 and group#2 in the group view + fileC left out which will be reported to the user as "the file that could not be grouped because it belongs to more than one group" and will be handled separately.

Complicated? It probably sounds so, but I hope it will be more clear when I post some screenshots soon, to ilustrated these ideas. (working on the GUI at the moment).
Well this is the best solution I could figure out for this issue. If someone has actually understood all this mess I wrote and then came up to the better solution I'd love to hear it :)


At the end of the day all this inner working really doesn't matter to the end user as long as he/she is able to have clear instant idea of how to use search results effectively, once they are presented. And that is is something I am trying to accomplish here - the easy of use. Grouping duplicate audio files, though somewhat complicated to implement, is very useful feature. It should group different files that all represent the same song and this will greatly help user to figure out which dupes should be removed.

2
Curt, If I understand you correctly, you would like to have an option that will add new music files to your collection only if they are not already present there? That is something I have also in mind since many users have sorted music collection and often they need to add new songs, so automated solution that will check if they already have those songs (with additional info like is it the better quality than the new file) would be useful


It seems to me that ít would be good to be able to easily differentiate between Exactly the Same & Similar - similar being where the percentages come in

Also, maybe an indication of what 90% similar could mean - could that be a different file (e.g. diff bitrate) but actually the same recording.
Ok thanks, this sounds reasonable. I will probably create several views as in DupeTrasher... some for exact dupes and similar ones with information on how they differ.

Another example - this may be unusual - I have a small collection but relatively a lot of live tracks - I often crop them at either end if there's too much waffling & naturally keep the original recordings. Would the app be able to tell me they exactly the same except one is cropped (this not really a feature request but you did ask for scenarios!)

Well in general it will be proof to the silence on the beginning of the track but I am not sure how it would handle stuff like this. Waffling and noise is part of the audio information and currently can't be treated separately. The audio detection is optimized for detecting same song in different qualities
However in scenarios like this I thik I can use supplementing features to identify duplicates (eg. fuzzy matching of the file names and ID3 tags). I'll do some tests and see how it goes...

I hope it will have sufficient options so the user can implement his/her own preferences. For example, if my preference is to start comparison at a similarity/tolerance of 60%, I should be able to do that without moving down from a higher pre-set level.
Yea sure, that will be included. In my testing so far I concluded that 90% or more match will identify the same song of different bitrate/samplerate/other quality parameters. In some cases, however, there is ~70% match if the encoding quality between two files of the same song is large (eg. flac and low bitrate wma). So definetely a must have feature.

Likewise, in observing matches I would like a choice as to how they are grouped--something like "show all matches" or "show most relevant matches". Regardless of the presentation of matches, the software will still have to perform the same amount of work (if I understand correctly), but the more presentation options, the better!

I am with you on this; the main reason for this thread is that I wanted to hear users how they would like to be presented with search data. It would help if you could describe this in more details, for example did you mean by "show most relevant matches" that program should show only matches that are 90% or higher?

Also, some type of (optional) cache management feature would be convenient so users don't have to start at ground zero every time they scan a large music collection.

That is already done  :Thmbsup:

3
Hello DC Users,

For me as a developer, I feel it is always good thing to ask users what they would like to see in an application before it is even released. So I am starting this thread, a similar to the one I did for DupeTrasher earlier this year here.

I am in the process of developing a new piece of software similar to my DupeTrasher duplicate file finder but this time a new app will be specialized for audio files only. Similarly, to the duplicate photo finders, this software should be able to find duplicate and similar audio files by comparing audio data in them (basically listening how they sound). In addition, it will use other parameters for duplicate detection like ID3 tags, name of the file and binary content, but those will be of secondary importance because that file data can be wrong and I want detection to be almost independent of them. With "audio listening", it should be able to recognize dupes even if they have different name, tag, bitrates, sample rate and file type (mp3/wma/ogg/flac...). It is designed specifically to resist lossy encodings but in some cases it should be able to even detect different performances of the same song (eg. live/album/remix).

Of course, this is just the first half of the problem which I have almost completed. The second and equally important is presenting information to the user so that he can decide what files should be removed in a shortest amount of time and with the least effort.

So feel free to post your general suggestions and requests (if you have any and if you are interested in software like this). If you have some common scenarios, where duplicate audio files are involved let me know so that I can analyze and find a solution for it, in the form of a feature that can make your life easier. Your ideas on graphic design and window layouts are also welcome.

One of the things I am currently brainstorming right now is how to present the search results to the user, considering that same song can be found in two different files which are not exactly equal. That is, their audio data is not 100% exact but rather similar to some extent (eg. 90% or more due to different encoding options/noise in the background/quality of the recording; unlike regular duplicate files in DupeTrasher which are completely equal so grouping and presenting them to the user was much easier task to me). So basically, lets say that songX can be found in forms of fileA, fileB and fileC, where
- fileA and fileB are 90% similar
- fileA and fileC are 90%
- but fileB and fileC are only 70% smilar
The reason for this is due to probabilistic nature of non-exact comparing (that is I use percents and similarity rather than "exactly equal" or "not equal" as in DupeTrasher). Now the question I have is, should I include all three files in the same duplicate group? And what about if fileC is 90% similar to the fileD which is just 50% similar fileA and fileB? I hope you can see the issue here.

The easiest way for me is to present a list of all files that have at least one duplicate and then let user to mark for delete the files he is interested in. So there would be no grouping of any kind just the plain list. But I don't think this is very coherent way of presenting search results of the duplicates search (I will probably include it just as one of the views though).


Of course, as always DC users will have special treatment for me. Beside discounts, I will provide free licenses for monthly newsletter, beta testers and other contributors.

Thanks! :)

4
When I first read about DupeTrasher, I swore not to try it! The reason is very simple and can be illustrated with a little screenshot from my True_Launchbar:
 (see attachment in previous post)
- and, yes, they are all fully licensed $hareware programs - edit except for the 'Nuker.

But why do I keep on purchasing this kind of programs? In fact, I didn't quite know the answer myself until I read patteo's post (Reply #18). What I am missing is speed (and ease) in the USE of the program (ANY program)! I acknowledge that DupeTrasher is a very fast scanner, but the examples given by patteo goes to show how to make the final result come out even faster yet.

Personally I have no idea how to write "genuine" command-line parameters, so I am bound to also ask for pre-written examples. As an example, if AsmDev is reading this and maybe even will try out the free Quizo QTTabbar with Vista's Explorer, then Vista+Quizo makes me able to add simple arguments like %f% and such (see next screenshot) to a shortcut to any program. I am however not convinced that these arguments are universal accepted.(?) At least, I think I have seen others use quite different arguments.
 (see attachment in previous post)


Hi Curt,

I will definitely try that software you suggested in order to gain more insight into problems you have. I think I will be able to develop some solution that will enable guys like patteo and you to automate common tasks and get the complex job done in a 2-3 clicks.

It would be helpful if you could tell me how common scenarios that you want to automate look like. For example, do you have a folder with master duplicates that you regularly compare against some other folders and want to remove files from them which are found in master folder?
Or when you plug in your USB flash would you like to have it compared against files you already have on your hard drive?

These two are just from the top of my head of what common scenarios people have when they want to get rid of dupes quickly.

5
Hi clmcveigh,

Yes I noticed a 30% order and I could link your name to your nick here. Thanks!

As for default search locations, you are right there are only hard drives set for search by default, but that is not a problem to update with other locations for the next version. I just didn't have any complains so far about it as users usually search only the hard drives. On the other hand it is also really easy to add any other location just by "drag and drop" (in the custom search mode).

Anyway, I am putting this on the TODO list so let me ask you a question regarding this: would you suggest that all available drives should be checked for the search by default?
This question applies for the "One click search" feature too, do you think all drives should be included or just leave it as it is now (only fixed drives)?
Note that network drives could make the whole process much much slower so this might annoy new user on the first impression.

It sure did find the dups on my hard-drive in no time flat. Will have to spend time learning how best to use all this program's functionality.

Well, I was trying to implement "don't make me think" approach for the users as much as possible, so hopefully you wont have to spend much time on learning but rather just "use it". :) I personally hate to learn how to use any "small" software. Once installed, I'd just like to do "less thinking and more doing" in order to get my job done.

Thanks



6
Hi guys,

I've just extended the end date for this 40% discount coupon as mouser and I agreed previously for this newsletter.
Here is the order link with the coupon code embedded so that you don't have to hustle with manual typing:
https://secure.shareit.com/shareit/checkout.html?PRODUCT[300333594]=1&COUPON1=DT2009-DC2W

(you will have to copy/paste it to web browser since this forum's link tag can't seem to handle this link correctly, probably because of the brackets in it... mouser, you might wanna check this out in your forum code)


Please excuse for this issue, I thought that DonationCoder newsletter will be released on October 1st and wasn't expecting it a few days earlier. Also, it appears that e-commerce server is located in Europe so 27th Sept. was adjusted to that time zone.

Thank you all for your support!

7
patteo, thanks for detailed description. I will probably include this feature in the next version.

8
Darwin and Shades, no problem, thank you for your feedback and testing!

Lutz, I am glad you liked the graphics since I am not professional designer but I tried my best.  :)

patteo, right now DupeTrasher doesn't have any command line features but I can implement that for the future version. What exactly would you like to have with this command line support? I haven't checked that feature in other duplicate scanners (will do that as soon as I get some time) so your feedback would be helpful as the starting point.

Innuendo, the name is DupeTrasher, not DupeThrasher as some guys here misstyped, but I dont mind as long as we understand each other ;)

9
Huh I can't believe I missed those typos... I guess sometimes developers are occupied too much with looking for functional bugs rather than grammatical  :huh: Thanks for that!

Since you liked it and you also found this typos please send me a private message, include your full name and your email address, as I'd like to give you a free license for this.

Thanks again!

10
Hi Darwin,

What are your impressions, did you find it useful? I'd love to hear some feedback.

Thanks

11
Hello,

I'd just like to let you know that I have finally released DupeTrasher. You can check it out on www.dupetrasher.com

Thanks goes to everyone for their suggestions, and for those who helped me beta test it I will send a free license this weekend. Other members from this community who are interested in this program can get it with the 40% discount. I've posted details and coupon code on this new topic

https://www.donation...ex.php?topic=19901.0

I hope this will become your new favorite duplicate file finder :). Thanks again for your posts!


12
DupeTrasher is an application that finds and removes duplicate files and folders from your computer. In addition, DupeTrasher can also look inside archives and find locations where they have been extracted.

All these features, packed in a user-friendly interface, make the whole process of removing duplicates: easy, fast and safe.

DupeTrasher's Website: http://www.dupetrasher.com/


Note for DonationCoder users: You can get DupeTrasher 2009 with the discount of 40% if you enter coupon code DT2009-DC2W on the order page.
This is time limited discount, available for the first two weeks (until 28. September 2009.) from the release of DupeTrasher 2009. This is my gift for this wonderful community, and in the next month I will provide some free licenses in accordance with mouser.

Best regards!

13
Hi Taqxim,
I didnt completely understand your idea but here is what I already had done:
dupetree.PNG

In this feature I have tree hierarchy of all folders where at least one duplicate file is found. When you click on some folder you can see all duplicate files in it. Note that regular files are not listed. This way you can mark for removal all duplicate files located in one folder (and its subfolders) by clicking in the checkbox. Let me know if this somewhat aligns with your idea. Thanks!

15
Hello everyone,

I've been working on this software last two months and now it is almost done. I reviewed your requests and most of them have been implemented. I've also added support for archive files (.zip and .rar for now, as they are most common on the internet) and here are some new screenshots:

Duplicate files in archives are presented in several ways and one of them is following: each archive file that has duplicates outside itself is added to this list. You can expand/collapse to see them and mark the ones you would like to remove:
http://img142.imageshack.us/img142/2497/archiveswithdupes.png


Also, there is a feature that will find archives with same content regardless of their name or extension. For example here are two archives with different name and type but with same content:
http://img142.imageshack.us/img142/3051/archiveswithsamecontent.png


And the last but not the least is the feature that will find all folders where archives have been extracted. This happens often as you download some .zip from the internet and extract it somewhere but later you forget about it completely. Each archive has list of folders attached to it in the list, where it has been extracted:
http://img142.imageshack.us/img142/4908/archiveswithextractedfo.png


If you still have some ideas feel free to let me know.

By the way, I am looking for BETA testers to whom I will be giving a free license after this software is released so if you'd like to contribute I'd really appreciate that. Please note that I need users who really need this kind of software and who would really like to give it a real-world test, and not just guys looking for free license. Those people should not worry either as I will be giving free licenses and discounts to this community when this project is done.

So send me a PM if you are interested, with your e-mail and operating system you use so that I can send you instructions and test build.

Thanks

16
Hey guys,

Darwin, all your ideas have already been implemented. Here is the screenshoot at current stage:

http://img68.imageshack.us/img68/2169/duplicate1oo6.png


Kamel, I'll let you know once beta version is ready for testing, thanks!

1. Possibility of detailed selection of drives and folders to scan/watch  - Implemented
2. Possibility of several "levels" of similarity between files. For example: a) exactly the same, b) same name, different size, c) similar name, same size, same date, and so on - Implemented a) and b), as for c) it will be left for future version as I need to tune the algorithm for detecting similarities between file names
3. Ability to set that the older files in a pair/group are the ones marked for deletion by default - Implemented
4. Ability to select only certain file types to be scanned (i.e. only music files, only video files, etc.) - Implemented
Btw, about point 3. here, I was thinking more about deleting all files in group and leaving just the oldest by default since it is most probable that the oldest file is original and the newer files were copied later on. But in any case I think I will add an option for user to select what file is to be considered original "the oldest" or "the newest".


city_zen's suggestions are all good.
i also always like various options for setting how the program should select the default one to keep.
So beside city_zen's suggestions do you have any other ideas on how to help the software to best determine what to select?
Btw there will be option to select all dupe files in specific folder, here is the screenshot:

http://img146.imageshack.us/img146/1298/duplicate2nd5.png



Here you can click the the checkbox near each folder and all duplicate files (not regular files) will be automatically selected.

- Possibility of several "levels" of similarity between files. For example: a) exactly the same, b) same name, different size, c) similar name, same size, same date, and so on
Add to that, same root name but different extension, so you can flag e.g. ZIP and RAR archives that are really the same.

So you are basically saying app should be able to detect 2 different archives with same content? Ok thats interesting idea, I will see what can I do about it.


Curt, if I understand right that photo is from app for duplicate pictures right? Well that kind of software is somehow different that regular duplicate file finders because photos with completely different name and content (eg. jpg, bmp) can be the same.
Also about columns, you said you'd like to see two columns but what if there are group of duplicates that have 3 or more same files? Well in that case I think a large number of columns would decrease minuteness.
Sorting is implemented of course, that is a must in any data handling software today.

17
First of all let me say hello to all users in this great community of which I am shamefully not a very active member. Some of you I already know as customers from the support mail exchange and there are also other guys like f0dder, who I know from other software developing forums.

My name is Milos and I am the author of DupeTrasher (http://www.asmdev.ne...products/dupetrasher).

I would like to announce the development of the new version of this application which will be, as I hope, the next association for the software for removing duplicate files. First version had some success but there is still much room for improvement. I carefully noted user requests and along with some new ideas of mine and with new Vista technology it is almost done.
I will be giving some free license codes for people in this community and also discounts once it is realised.  :Thmbsup:

I'd like to ask you guys what are the common scenarios you have when dealing with duplicate files (beside searching for all drives). For example one feature that I implemented is to find all duplicate files in hard drive which are already available in CD/DVD. After search is done all dupes in located in hard drive will be automatically marked for removal. Then, user can review the selection and proceed with deleting if needed. This saves user's time and does the job with minimal effort.
So this is just one common scenario that crossed my mind, do you have other similar? I will gladly implement them if they seem to be useful.
Also feel free to write any other feature you think would be useful in an application like this.

Thanks

Pages: [1]