Author Topic: Duplicate documents search with similarity/fuzzy logic ? (Read 4186 times)

MrCrispy · « **on:** June 03, 2008, 08:30 PM »

One of the projects I'm working on has a huge number of document files (html, rtf, txt, word) many of which contain duplicate content. They contain things like article abstracts, full content and archived copies. So there are a lot of duplicates, however the files are not the same. There are things like 2 files with same content but different html title, or 1 file which contains the text present in 3 others etc. Its all big mess

Usual duplicate finders will not work. I'm not sure if there is any tool that can help me - I need something which will identify duplicate snippets by examining the file contents, and then intelligently recognize how it maps to duplicate files, and which ones can be deleted.

Any suggestions are welcome.

PhilB66 · « **Reply #1 on:** June 03, 2008, 08:45 PM »

TextDiff

WinMerge

KDiff3

MrCrispy · « **Reply #2 on:** June 04, 2008, 12:27 AM »

Thanks, but a diff/merge utility won't be enough, because it would need to check thousands of files against each other.

steeladept · « **Reply #3 on:** June 04, 2008, 12:39 AM »

What if you used a script to feed the diff/merge utility? Assuming they provide some sort of selection automation, the script can go through the files and feed the utility which would then make the decisions and then you would feed the next file. Once completed, you could then go to the next file and start again after moving the first file to a new location. Well, something like that anyway. Don't know if any utilities provide that level of automation, however.

Lashiec · « **Reply #4 on:** June 04, 2008, 10:16 AM »

Mmmm, jv16 PowerTools has a function to specify how similar you want the files to be, with percentages. Don't know if it will work as it should because the only time I tried it, failed completely. I guess the most capable duplicate finders should have something like that as well.

Double Killer, for example

Author Topic: Duplicate documents search with similarity/fuzzy logic ? (Read 4186 times)

MrCrispy

Duplicate documents search with similarity/fuzzy logic ?

PhilB66

Re: Duplicate documents search with similarity/fuzzy logic ?

MrCrispy

Re: Duplicate documents search with similarity/fuzzy logic ?

steeladept

Re: Duplicate documents search with similarity/fuzzy logic ?

Lashiec

Re: Duplicate documents search with similarity/fuzzy logic ?