topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 4:05 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Text comparison tool with most robust similarity/moved block detection  (Read 14638 times)

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Hi forum,

often I use text comparison tools, for example WinMerge, in order to compare two document versions, for example from different sources.

The problem is that the documents often originate from (different) OCR processes and are only slightly different mainly due to the OCR errors. Therefore, even if the contents of two original documents should be identical, WinMerge often does not recognize this and therefore marks the entire document (erroneously) as completely different.

Therefore my question is whether anyone knows of a text comparison tool that has a even more robust block similarity detection* than for example WinMerge has.

Thanks for hints
David.P
__
*) Examples for similarity/moved block detection:
Meld:

(also not good with OCR'ed files)

XDiff:

(could not find a not commercial version for Windows)

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
WinMerge can do what you want. Just set the line-difference highlighting to character level:

1M8jk.png
-Superuser 4 yrs ago

Other than that:

16 Alternatives to WinMerge:
https://www.topbesta...natives-to-winmerge/

Free or Open Source WinMerge Alternatives for Windows - AlternativeTo.net:
https://alternativet...amp;platform=windows

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,939
    • View Profile
    • Donate to Member
'ExamDiff Pro' and 'Beyond Compare' are commercial Differs that do, for my intends and purposes, a very good job on Windows.

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Thanks for the responses.

I have already tried lots of Differ tools, but none that I am aware of had a particularly robust moved block detection (when there are only so much little differences within otherwise identical blocks).

Therefore, still thankful for actual experiences regarding, expressly, moved block detection.

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
For completeness, lately I've been playing around with the online diff tools at wikEd and at Vroniplag -- and must say that they are both pretty amazing with regard to moved blocks detection. Moved blocks are even detected when they additionally have been edited in one or both copies of the text.


irkregent

  • Participant
  • Joined in 2009
  • *
  • default avatar
  • Posts: 27
    • View Profile
    • Donate to Member
I just ran across another diff tool that you might want to try:

True Human Design - Diffinity Official Homepage
http://truehumandesign.se/s_diffinity.php


David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Hi all,

I was just going to report that I tested some online and off-line (*.exe) diff tools again. Thereby, the installed tools, like the latest versions of WinMerge and Meld, proved to have no chance against the best online tools. Both WinMerge and Meld fail completely as soon as the texts become just a little dissimilar.

I checked like a dozen online comparison tools, and the best ones (by far) still seem to be Vroniplag (great multi-color highlighting of identical textblocks including moved block detection), and wikEd diff (highly configurable, including moved block detection). Vroniplag even lets you copy the results including all the color highlighting into Word, however only if the comparison is done with Internet Explorer. Both with Chrome and Firefox, this does not seem to work properly.

Copyleaks.com seems to be even more sophisticated since it at least tries to detect similarity also when there are small differences like for example OCR errors. Also, you can click on text passages and it will highlight the respective passage in the other text. However, it is a) terribly slow as compared to the other two mentioned above, b) has no multi-color coding for quick overview, c) breaks longer texts into multiple pages that you have to click through, and d) you can't quickly edit one or both of your texts and redo the comparison with the edited texts.

I am yet to discover a service or tool that uses some kind of fuzzy logic or artificial intelligence in order to spot and visibly highlight similarities in a "stepless" manner, for example by using some sort of heat map color highlighting of the similarities.

Cheers
David
« Last Edit: February 09, 2019, 01:48 PM by David.P, Reason: typo »

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,939
    • View Profile
    • Donate to Member
ExamDiff Pro  and  BCompare are good off-line comparison tools. Can also keep track of changes in different folders and both come with decent text editor built-in. The latest incarnations of these software packages should have no problem with applying fuzzy logic to the content being compared.

Or did I misunderstand and you only consider on-line diffing solutions?

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Thanks, I see now that at least ExamDiff Pro says that it has some sort of fuzzy comparison.

In the meantime I found this, from the makers of Vroniplag. IMHO this tool blows everything else out of the water, at least when you're looking for text comparison with lots of moved blocks.

I will try ExamDiff Pro  and  BeyondCompare and report back.

Additionally, I am now looking for a tool that can find and highlight repeated blocks of text within a single file. So far, the only solution I found is Textanz:

8N28iVO[1].jpgText comparison tool with most robust similarity/moved block detection

Textanz however only highlights repeated blocks in one single color which makes it very tedious to follow up when there are lots of (possibly nested) repetitions.
« Last Edit: February 10, 2019, 04:23 AM by David.P »

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Aaaarggghhhh, this is 2019 and BeyondCompare still can't wrap text to the width of the respective window  :down: :down: :down: :down: :down: :down: :down: :mad: :mad: :mad: :mad: :mad: :mad:

This unfortunately makes BeyondCompare completely UNUSABLE for text analysis (other than for program code, possibly).

This is also valid similarly for ExamDiff Pro. While ExamDiff Pro can (sort of) wrap text, the output is not practical for text analysis.

Summing up my experience: while practically all off-line tools like the ones mentioned above might be well suitable for comparing program code versions, they are completely and utterly unusable for text analysis.

This is what the comparison output of the typical off-line program looks:
2019-02-10_130557.pngText comparison tool with most robust similarity/moved block detection

Below is what the comparison output of Similarity Texter looks, and this is what I'm after:
2019-02-10_131038.pngText comparison tool with most robust similarity/moved block detection
« Last Edit: February 10, 2019, 06:21 AM by David.P »

NigelH

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 210
    • View Profile
    • Donate to Member
Re: Text comparison tool with most robust similarity/moved block detection
« Reply #10 on: February 10, 2019, 07:09 AM »
I ran into this exact problem last week with Beyond Compare Pro when looking at changes in a really long make generated command line.
The "solution" is to use the text compare report feature - the lines wrap quite nicely.


With regard to moved blocks of code, it still annoys the heck out of me that BC battles to correctly detect inserted blocks of code that are similar to preceding code.
Quite often, it show the terminating brace (often more) of the unchanged C code as part of the change set - with the same new code showing up as part of the unchanged code.
I see this happing almost every week and no matter how much I futz around with the alignment options, I can't get it to recognize just the piece of new code for what it is.
Drives me nuts, especially as DeltaWalker and KDiff3 have no problem showing the correct change.

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Re: Text comparison tool with most robust similarity/moved block detection
« Reply #11 on: February 10, 2019, 07:24 AM »
I can feel your pain.

This also looks interesting, but unfortunately is not yet able to actually show/highlight the fuzzy text matches that it detects in the background:
https://www.tools4noobs.com/online_tools/string_similarity/

David.P

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 208
  • Ergonomics Junkie
    • View Profile
    • Donate to Member
Re: Text comparison tool with most robust similarity/moved block detection
« Reply #12 on: February 15, 2019, 01:14 PM »
Please read posts #9 and #10.

Thank you.

Edit:
added another link
https://stackoverflo...three-way-merge-tool