DonationCoder.com Forum

Main Area and Open Discussion => General Software Discussion => Topic started by: David.P on August 08, 2016, 09:01 AM

Title: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on August 08, 2016, 09:01 AM
Hi forum,

often I use text comparison tools, for example WinMerge, in order to compare two document versions, for example from different sources.

The problem is that the documents often originate from (different) OCR processes and are only slightly different mainly due to the OCR errors. Therefore, even if the contents of two original documents should be identical, WinMerge often does not recognize this and therefore marks the entire document (erroneously) as completely different.

Therefore my question is whether anyone knows of a text comparison tool that has a even more robust block similarity detection* than for example WinMerge has.

Thanks for hints
David.P
__
*) Examples for similarity/moved block detection:
Meld:
(http://i.imgur.com/hplxCiD.png)
(also not good with OCR'ed files)

XDiff:
(http://i.imgur.com/C5l7auY.png)
(could not find a not commercial (https://www.plasticscm.com/features/xmerge.html) version for Windows)
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: Curt on August 08, 2016, 10:55 AM
WinMerge can do what you want. Just set the line-difference highlighting to character level:

[ You are not allowed to view attachments ]
-Superuser 4 yrs ago

Other than that:

16 Alternatives to WinMerge:
https://www.topbestalternatives.com/2015/top-5-best-alternatives-to-winmerge/

Free or Open Source WinMerge Alternatives for Windows - AlternativeTo.net:
https://alternativeto.net/software/winmerge/?license=free&platform=windows
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: Shades on August 09, 2016, 11:35 AM
'ExamDiff Pro' and 'Beyond Compare' are commercial Differs that do, for my intends and purposes, a very good job on Windows.
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on August 11, 2016, 10:43 AM
Thanks for the responses.

I have already tried lots of Differ tools, but none that I am aware of had a particularly robust moved block detection (when there are only so much little differences within otherwise identical blocks).

Therefore, still thankful for actual experiences regarding, expressly, moved block detection.
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on May 12, 2017, 11:59 AM
For completeness, lately I've been playing around with the online diff tools at wikEd (http://cacycle.altervista.org/wikEd-diff-tool.html) and at Vroniplag (http://de.vroniplag.wikia.com/wiki/Quelle:Textvergleich) -- and must say that they are both pretty amazing with regard to moved blocks detection. Moved blocks are even detected when they additionally have been edited in one or both copies of the text.

Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: irkregent on May 13, 2017, 04:03 PM
I just ran across another diff tool that you might want to try:

True Human Design - Diffinity Official Homepage
http://truehumandesign.se/s_diffinity.php (http://truehumandesign.se/s_diffinity.php)

Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on February 09, 2019, 08:00 AM
Hi all,

I was just going to report that I tested some online and off-line (*.exe) diff tools again. Thereby, the installed tools, like the latest versions of WinMerge and Meld, proved to have no chance against the best online tools. Both WinMerge and Meld fail completely as soon as the texts become just a little dissimilar.

I checked like a dozen online comparison tools, and the best ones (by far) still seem to be Vroniplag (http://de.vroniplag.wikia.com/wiki/Quelle:Textvergleich) (great multi-color highlighting of identical textblocks including moved block detection), and wikEd diff (http://cacycle.altervista.org/wikEd-diff-tool.html) (highly configurable, including moved block detection). Vroniplag even lets you copy the results including all the color highlighting into Word, however only if the comparison is done with Internet Explorer. Both with Chrome and Firefox, this does not seem to work properly.

Copyleaks.com (https://copyleaks.com/compare) seems to be even more sophisticated since it at least tries to detect similarity also when there are small differences like for example OCR errors. Also, you can click on text passages and it will highlight the respective passage in the other text. However, it is a) terribly slow as compared to the other two mentioned above, b) has no multi-color coding for quick overview, c) breaks longer texts into multiple pages that you have to click through, and d) you can't quickly edit one or both of your texts and redo the comparison with the edited texts.

I am yet to discover a service or tool that uses some kind of fuzzy logic or artificial intelligence in order to spot and visibly highlight similarities in a "stepless" manner, for example by using some sort of heat map color highlighting of the similarities.

Cheers
David
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: Shades on February 10, 2019, 01:24 AM
ExamDiff Pro  and  BCompare are good off-line comparison tools. Can also keep track of changes in different folders and both come with decent text editor built-in. The latest incarnations of these software packages should have no problem with applying fuzzy logic to the content being compared.

Or did I misunderstand and you only consider on-line diffing solutions?
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on February 10, 2019, 03:27 AM
Thanks, I see now that at least ExamDiff Pro says that it has some sort of fuzzy comparison.

In the meantime I found this (https://people.f4.htw-berlin.de/~weberwu/simtexter/app.html), from the makers of Vroniplag. IMHO this tool blows everything else out of the water, at least when you're looking for text comparison with lots of moved blocks.

I will try ExamDiff Pro  and  BeyondCompare and report back.

Additionally, I am now looking for a tool that can find and highlight repeated blocks of text within a single file. So far, the only solution I found is Textanz (http://www.textanz.com/index.php):

[ You are not allowed to view attachments ]

Textanz however only highlights repeated blocks in one single color which makes it very tedious to follow up when there are lots of (possibly nested) repetitions.
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on February 10, 2019, 05:51 AM
Aaaarggghhhh, this is 2019 and BeyondCompare still can't wrap text to the width of the respective window  :down: :down: :down: :down: :down: :down: :down: :mad: :mad: :mad: :mad: :mad: :mad:

This unfortunately makes BeyondCompare completely UNUSABLE for text analysis (other than for program code, possibly).

This is also valid similarly for ExamDiff Pro. While ExamDiff Pro can (sort of) wrap text, the output is not practical for text analysis.

Summing up my experience: while practically all off-line tools like the ones mentioned above might be well suitable for comparing program code versions, they are completely and utterly unusable for text analysis.

This is what the comparison output of the typical off-line program looks:
[ You are not allowed to view attachments ]

Below is what the comparison output of Similarity Texter looks, and this is what I'm after:
[ You are not allowed to view attachments ]
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: NigelH on February 10, 2019, 07:09 AM
I ran into this exact problem last week with Beyond Compare Pro when looking at changes in a really long make generated command line.
The "solution" is to use the text compare report feature - the lines wrap quite nicely.


With regard to moved blocks of code, it still annoys the heck out of me that BC battles to correctly detect inserted blocks of code that are similar to preceding code.
Quite often, it show the terminating brace (often more) of the unchanged C code as part of the change set - with the same new code showing up as part of the unchanged code.
I see this happing almost every week and no matter how much I futz around with the alignment options, I can't get it to recognize just the piece of new code for what it is.
Drives me nuts, especially as DeltaWalker and KDiff3 have no problem showing the correct change.
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on February 10, 2019, 07:24 AM
I can feel your pain.

This also looks interesting, but unfortunately is not yet able to actually show/highlight the fuzzy text matches that it detects in the background:
https://www.tools4noobs.com/online_tools/string_similarity/ (https://www.tools4noobs.com/online_tools/string_similarity/)
Title: Re: Text comparison tool with most robust similarity/moved block detection
Post by: David.P on February 15, 2019, 01:14 PM
Please read posts #9 and #10.

Thank you.

Edit:
added another link
https://stackoverflow.com/questions/572237/whats-the-best-three-way-merge-tool