Welcome Guest.   Make a donation to an author on the site September 02, 2014, 11:42:04 AM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2012! Download dozens of custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: Duplicate documents search with similarity/fuzzy logic ?  (Read 2042 times)
MrCrispy
Participant
*
Posts: 330


View Profile Give some DonationCredits to this forum member
« on: June 03, 2008, 08:30:06 PM »

One of the projects I'm working on has a huge number of document files  (html, rtf, txt, word) many of which contain duplicate content. They contain things like article abstracts, full content and archived copies. So there are a lot of duplicates, however the files are not the same. There are things like 2 files with same content but different html title, or 1 file which contains the text present in 3 others etc. Its all big mess Sad

Usual duplicate finders will not work. I'm not sure if there is any tool that can help me - I need something which will identify duplicate snippets by examining the file contents, and then intelligently recognize how it maps to duplicate files, and which ones can be deleted.

Any suggestions are welcome.
Logged
PhilB66
Supporting Member
**
Posts: 1,510


View Profile Give some DonationCredits to this forum member
« Reply #1 on: June 03, 2008, 08:45:19 PM »

TextDiff

WinMerge

KDiff3
Logged
MrCrispy
Participant
*
Posts: 330


View Profile Give some DonationCredits to this forum member
« Reply #2 on: June 04, 2008, 12:27:23 AM »

Thanks, but a diff/merge utility won't be enough, because it would need to check thousands of files against each other.
Logged
steeladept
Supporting Member
**
Posts: 1,056



Fettucini alfredo is macaroni & cheese for adults

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #3 on: June 04, 2008, 12:39:57 AM »

What if you used a script to feed the diff/merge utility?  Assuming they provide some sort of selection automation, the script can go through the files and feed the utility which would then make the decisions and then you would feed the next file.  Once completed, you could then go to the next file and start again after moving the first file to a new location.  Well, something like that anyway.  Don't know if any utilities provide that level of automation, however.
Logged
Lashiec
Member
**
Posts: 2,374


see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #4 on: June 04, 2008, 10:16:21 AM »

Mmmm, jv16 PowerTools has a function to specify how similar you want the files to be, with percentages. Don't know if it will work as it should because the only time I tried it, failed completely. I guess the most capable duplicate finders should have something like that as well.

Double Killer, for example smiley
« Last Edit: June 04, 2008, 10:20:53 AM by Lashiec » Logged
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.027s | Server load: 0 ]