topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 1:24 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Duplicate documents search with similarity/fuzzy logic ?  (Read 4153 times)

MrCrispy

  • Participant
  • Joined in 2006
  • *
  • Posts: 332
    • View Profile
    • Donate to Member
One of the projects I'm working on has a huge number of document files  (html, rtf, txt, word) many of which contain duplicate content. They contain things like article abstracts, full content and archived copies. So there are a lot of duplicates, however the files are not the same. There are things like 2 files with same content but different html title, or 1 file which contains the text present in 3 others etc. Its all big mess :(

Usual duplicate finders will not work. I'm not sure if there is any tool that can help me - I need something which will identify duplicate snippets by examining the file contents, and then intelligently recognize how it maps to duplicate files, and which ones can be deleted.

Any suggestions are welcome.

PhilB66

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 1,522
    • View Profile
    • Donate to Member

MrCrispy

  • Participant
  • Joined in 2006
  • *
  • Posts: 332
    • View Profile
    • Donate to Member
Re: Duplicate documents search with similarity/fuzzy logic ?
« Reply #2 on: June 04, 2008, 12:27 AM »
Thanks, but a diff/merge utility won't be enough, because it would need to check thousands of files against each other.

steeladept

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,061
    • View Profile
    • Donate to Member
Re: Duplicate documents search with similarity/fuzzy logic ?
« Reply #3 on: June 04, 2008, 12:39 AM »
What if you used a script to feed the diff/merge utility?  Assuming they provide some sort of selection automation, the script can go through the files and feed the utility which would then make the decisions and then you would feed the next file.  Once completed, you could then go to the next file and start again after moving the first file to a new location.  Well, something like that anyway.  Don't know if any utilities provide that level of automation, however.

Lashiec

  • Member
  • Joined in 2006
  • **
  • Posts: 2,374
    • View Profile
    • Donate to Member
Re: Duplicate documents search with similarity/fuzzy logic ?
« Reply #4 on: June 04, 2008, 10:16 AM »
Mmmm, jv16 PowerTools has a function to specify how similar you want the files to be, with percentages. Don't know if it will work as it should because the only time I tried it, failed completely. I guess the most capable duplicate finders should have something like that as well.

Double Killer, for example :)
« Last Edit: June 04, 2008, 10:20 AM by Lashiec »