ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Duplicate documents search with similarity/fuzzy logic ?

(1/1)

MrCrispy:
One of the projects I'm working on has a huge number of document files  (html, rtf, txt, word) many of which contain duplicate content. They contain things like article abstracts, full content and archived copies. So there are a lot of duplicates, however the files are not the same. There are things like 2 files with same content but different html title, or 1 file which contains the text present in 3 others etc. Its all big mess :(

Usual duplicate finders will not work. I'm not sure if there is any tool that can help me - I need something which will identify duplicate snippets by examining the file contents, and then intelligently recognize how it maps to duplicate files, and which ones can be deleted.

Any suggestions are welcome.

PhilB66:
TextDiff

WinMerge

KDiff3

MrCrispy:
Thanks, but a diff/merge utility won't be enough, because it would need to check thousands of files against each other.

steeladept:
What if you used a script to feed the diff/merge utility?  Assuming they provide some sort of selection automation, the script can go through the files and feed the utility which would then make the decisions and then you would feed the next file.  Once completed, you could then go to the next file and start again after moving the first file to a new location.  Well, something like that anyway.  Don't know if any utilities provide that level of automation, however.

Lashiec:
Mmmm, jv16 PowerTools has a function to specify how similar you want the files to be, with percentages. Don't know if it will work as it should because the only time I tried it, failed completely. I guess the most capable duplicate finders should have something like that as well.

Double Killer, for example :)

Navigation

[0] Message Index

Go to full version