One of the projects I'm working on has a huge number of document files (html, rtf, txt, word) many of which contain duplicate content. They contain things like article abstracts, full content and archived copies. So there are a lot of duplicates, however the files are not the same. There are things like 2 files with same content but different html title, or 1 file which contains the text present in 3 others etc. Its all big mess
Usual duplicate finders will not work. I'm not sure if there is any tool that can help me - I need something which will identify duplicate snippets by examining the file contents, and then intelligently recognize how it maps to duplicate files, and which ones can be deleted.
Any suggestions are welcome.