The problem with text compare tools - similar, in database compare tools

evamaria

Participant
Joined in 2013
Posts: 61

The problem with text compare tools - similar, in database compare tools

« on: July 14, 2013, 05:26 AM »

Here on DC, there is an old "comparitive" review of Beyond Compare:

https://www.donation...pareTools/index.html

It states, there is one prob with BC, which is lack of in-place editing. Now, the review being from 2005, this prob has been resolved long ago.

But there is another prob with almost all of these text compare tools, which is they are unable to compare "displaced text parts", or whatever you would call them.

"All" these tools just compare line by line, and the respective line content, and are thus unable to "see" that a bunch of 10 or 50 lines is UNCHANGED, whenever that bunch of lines has been displaced elsewhere in the second text.

For every programmer who displaces routines within his global set of programming lines, this makes these ordinary tools almost unusable, and existing tools which lack such functionality, unfortunately are NOT amended in such a way (and the developers of BC (of which I have a license) are very friendly but don't do the necessary development to their otherwise fine program either), but there should be SOME tools at least that DO such a thing.

Now let me explain. I know that for discovering that a single LINE has been displaced elsewhere, first, such a tool would need a much more complicated algorithm, since without being really sophisticated, it would process lots of false "hits": It goes without saying that in normal texts, here and there, but in programming, LOTS of lines would be identical, but without being displaced, it's just that the same lines occur, again and again, within many, DIFFERENT, contexts. So, these respective contexts would have to be analyzed, too, by such a tool.

Also, the same problem COULD occur with "PARAGRAPHS": Since in programming, a line is also a paragraph, for such a tool, checking for "paragraph" first, would not be helpful in order to avoid such "false hits". On the other hand, most paragraphs in normal texts would be "enclosed" by blank lines, whilst in programming, these lines would be paragraphs, but normally NOT "enclosed" by blank lines, so the algorithm could check for "REAL" paragraphs vs. paragraphs that are just separate lines, and then try for finding just these "real paragraphs" elsewhere, whenever in place they would be expected, in text 2, they are missing. So this would be ONE OPTION for such a tool.

A SECOND OPTION for such a tool (and this could be realized even within the same tool, "by option") would be to look after a special character or any other "divider character combination" or such, i.e. it would not even to check for displaced "real paragraphs", but only for displaced "paragraph groups" / "text entities" or such, meaning, when text enclosed in these "divider codes" is there in text 1, but missing in text 2. Such "divider codes" could be TWO blank lines, or a really special character that does not occur BUT for separating your "programming entities" within your big file (e.g. the Japanese Yen character, or anything you want), and it would be easy to put such a very special character into your programming text whenever needed, since any programming language has got a special character for "comment line", and you would only put such lines between your sub-routines, in the form

CommentLineCharacter and then SeparateEntityBeginCode

Also, such a tool, with such functionality, would be a relief for anybody doing his work within an outliner, since outliners "invite" to re-arrange all your stuff again and again, in order to have, in the end and ideally, all your stuff within "meaning ful context", in the same way some people don't so much multiply separate paper files, but put them, whenever possible, into lever files, in order to group them. (It goes without saying here that this is a good thing for "reference material", but not really advisable for separate customer files or such.)

Now, most of these outliners have got an EXPORT function, to plain text at least, and for a text compare tool, this is the format in which such "outliner files" could be read and be compared, and it immediately becomes evident what the above-mentioned functionality would be able to to here:

Any outliner item that you just would have displaced, would be checked by such a tool, and then discarded as identical, which in fact it is, and this tool would only show then items in which you would have done ADDITIONS or CHANGES, or NEW items, or (in the other pane, respectively,) DELETED items, but it would NOT show all these, perhaps numerous, items that are unchanged, but just displaced to another position within your tree.

So, here as with programming bits above, the question is, which TEXT compare tools are able to to such a more sophisticated comparison, without all those "false hits" showing up in less elaborate tools,

AND, there also should perhaps be some DATABASE compare tools that are able to do such a comparison, by database "RECORD CONTENT" comparison. Here, the very first problem is that most databse compare tools do NOT EVEN compare content, but only structure, and those that are (possibly) able to compare content, are "just for MySQL", so the question is, are they able to compare "records" in just text format, and with the "record begin code" of your choice (of course, it would be possible to use the special character the db compare tools then "needs", or to replace the "commentlinecharacter plus recordbegincharer" to the special "recordbegincharacter" the db compare tool then needs in order to properly function.

But there is also the question if these db compare tools, comparing content, are able to then compare the content of any record to the content of any record, or if they, too, as in the usual text compare tools, just compare content of record 1 in "text" 1 to content of record 1 in "text" 2, then record 2 to record 2, and so on, which would be devoid of any sense. (Of course, there is an additional prob with db compare tools, price: some of them are 1,000 dollars or even more, so I'm looking out for such a tool, in case, that's not as expensive as that.)

Hence my questions:

- any insight into text compare tools, with respect to these details?
- or any insight into database compare tools, ditto?

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #1 on: July 14, 2013, 05:51 AM »

"ExamDiff Pro" seems to have a "moved blocks" function, but is 35 dollars per year. That's not so expensive after all if it functions well, but I totally hate subscription schemes...

Ath

Supporting Member
Joined in 2006
Posts: 3,640

Re: The problem with text compare tools - similar, in database compare tools

« Reply #2 on: July 14, 2013, 05:57 AM »

For text comparison I always use the shareware Compare It! by Grigsoft
Development has stalled the last couple of years, but for me it's a highly valued tool, and I would buy it again if I had to.

There is an option 'Find moved sections' in the 'Options/Options/Comparison/Advanced' options page that needs to be enabled for finding moved stuff like you described. And setting 'Use adaptive comparison' on the same page could also help.
I usually advise to enable 'Options/Lines view' from the menu (it's disabled by default), and the same for 'Options/Options/Editor' checkbox "Scroll left pane horizontally with mouse wheel".

Disclaimer:
I wrote a translation for my native language for it (and also for Synchronize It!), but have no other bonds with the product(s) or company.

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #3 on: July 14, 2013, 06:16 AM »

There's a thread asking a similar question here:

http://stackoverflow...d-multiple-revisions

"Which file comparison tool can handle block movement and multiple revisions?"

unfortunately closed by an overzealous mod, but then, many of the answers are false hits, and 21 "up" votes for the mention of a program that precisely doesn't offer the wanted feature, well... but in any closed thread, false info will stay forever, instead of being corrected...!

As for the similar problem in databases, there's this thread:

http://stackoverflow...-two-mysql-databases :

"Compare two MySQL databases"

which also has been closed by an overzealous mod, for "being not constructive", and here, I'm a little bit speechless, though, since as we all can see, it's a highly constructive question, and many answers seem to be more than helpful, so it would have been a good thing to add more of such.

Thank you, Ath, will try! There also seem to be some more of this kind now, will report if there are positive results.

« Last Edit: July 15, 2013, 11:33 AM by evamaria »

Shades

Member
Joined in 2006
Posts: 2,948

Re: The problem with text compare tools - similar, in database compare tools

« Reply #4 on: July 14, 2013, 09:35 AM »

Apparently ExamDiff Pro can handle moved blocks since May 2009 (version 4.5 and up), see here.

I don't know which version (of ExamDiff) you were using/trying, but I cannot stand by when one of my favorites get shot down

Shades

Member
Joined in 2006
Posts: 2,948

Re: The problem with text compare tools - similar, in database compare tools

« Reply #5 on: July 14, 2013, 10:22 AM »

Comparing databases is done (in most cases), because a DBA suspects a change in structure, which can be very problematic for software especially written for that particular database structure. Although one should not program like this, I think it can result in faster applications when one already knows exactly where to look/write.

For Oracle databases there are tools that can also do content comparison, some are basic and free, but in most cases expect to pay dearly (especially with Oracle products). With MySQL it is most of the time not that difficult to make a complete extract (structure and content) into a .sql file and which can be easily checked with a text comparison tool for differences.
I use XAMPP which comes with MySQL, a webserver and the PHPMyAdmin scripts. With that it is easy to create a complete dump of any MySQL database.

I have no personal experience with other databases (only some minor PostgreSQL), such as Microsoft and IBM database products, but I suspect that tools for these databases are more similar in functionality and price with the tools for Oracle products than for MySQL.

A long(er) shot would be to use the 'Oracle SQL Developer' software for both types of comparisons. This software you get for free when installing the latest Oracle client and/or database software (you can also fish this software out of the installation archive and run it as a portable application. All the necessary Java software is included, you do not need to install Java, you just have to configure it once). It is not included in Oracle XE versions, last time I checked.

If memory serves me right, this software is designed for Oracle databases (duh!), it is also able to handle Microsoft Access and as MySQL is now part of the Oracle family it should be able to work with these databases too.

Granted, the above solution doesn't sound simple and easy, but at least it is free. If you are interested , you could PM me with an FTP/dropbox location where I can dump an archive from the Oracle SQL Developer software I use on my system.

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #6 on: July 14, 2013, 12:30 PM »

EDIT: Sorry, Shades, I had overlooked your first post mentioning ExamDiff Pro, and indeed, it's on my list in order to trial, as the other very few mentioned below.

Thank you, Shades, for sharing your experience. In fact, I'm heavily googling for such db dump and compare tools, and there are several ones that bear 2 price tags: structure only, and then structure plus content, so I thought about dumps indeed, but such a dump then would be nothing more than I have got with my text files and their item = "records" divider code character: Those "records", then, are in total "disorder", from any text compare tool's pov, so putting my things into a db (which will be a problem in itself), then extracting it, dump-wise, will not get me many steps forward!

"Not many steps", instead of "no step", because, in fact, there IS some step in the right direction here: In fact, many text compare tools and programmable editors and such allow for SORTING LINES, but as you will have understood, my items ain't but just single lines, but bulks of lines, so any "sort line, then compare" idea is to be discarded.

And here, a db and its records would indeed help, since in a db, you can sort records, not just sort lines, and that's why...

HEUREKA for my specific problem at least:

- I can import my "text dump" (= text file, from my 1,000 "items", divided by a code character into "records") from my file 1, into an askSam file.
- ditto for my file 2.

- I then sort both AS files by first line of their records.

- I then export both AS files into respective text files.

- I then can import those NEW text files into ANY text compare tool (e.g. BC that I "know"), and then, only some 10 or so new or altered items would be shown, not 200 moved ones.

- The same would be possible with any sql db and then its sorted records, dumped into text files, then compared by any tool.

- In any case, this is a lot of fiddling around (and hoping that buggy AS will not mix up things but will work according to its claims).

Back to the original problem:

My googling for "recognize moved text blocks" gave these hits, among others:

http://www.huffingto...and-s_b_2754804.html
= "6 Ways to Recognize And Stop Dating A Narcissist" (= hit number 43 or 44)
Now hold your breath: This got 1,250 comments! (= makes quite a difference from dull comp things and most threads here...)
Very funny article, by the way: laughed a lot!

http://www.velocityr...locks-detection.html
So we are into Damereau-Levenshtein country here (not that I understand a mumbling word of this...), of "time penalties", and of "Edit distance"
( http://en.wikipedia....g/wiki/Edit_distance )
(Perhaps, if I search long enough for "edit distance", I'll get back to another Sandy Weiner article? "Marriages come and go, but divorce is forever", she claims... well, there are exceptions to every rule, RIP Taylor/Burton as just ONE example here)

Also of interest in this context, "Is there an algorithm that can match approximate blocks of text?" (well, this would be another problem yet):

http://digitalhumani...imate-blocks-of-text

And also, this Shapira/Storer paper, "Edit Distance with Move Operations":

http://www.cs.brande...cations/JDAmoves.pdf

(Very complicated subject of which I don't understand but that my, call it "primitive", way, is definitely sidestepping this problem: equal-block elimination (!) first, THEN any remaining comparison:)

Now, not being a programmer (and thus, not knowing "any better"), I would attack the problem like this, for whenever the setting for this speciality is "on":

- have a block separator code (= have the user enter it, then the algorithm knows what to search for)
- look for this code within file 1, build a list for the lines (= block 1 = lines 1 to 23, block 2 = lines 24 to 58, etc.)

for any block "a" in file 1 (see the list built up before):
- apply the regular / main (line-based) algorithm, FOR block "a" in file 1, TO any of the blocks 2 to xyz in file 2
- mark any of the "hit" blocks in file 2 as "do not show and process any further"
- this also means, no more processing of these file 2 blocks for further comparisons with further file 1 blocks
end for

- from block 2, show and process only any lines there that have NOT been already discarded by the above
(for other non-programmers: all this is easily obtained by first duplicating the original file 2, then just deleting the "hit" parts of that duplicate, progressively)

So where, good heavens, are the difficulties in such an option / sub-routine, by which otherwise fine progs like BC (where users have been begging since 2005, and where the developers always say, "it's on our wish list" - what?! a "wish list", not for users, but for the developers? I'd always thought their wish list was called a "road map"! or who's deemed to do the work, for these developers, then?!) or Araxis Merge (about 300 bucks, quand même !) seem to be unable to realize it?

Do I overlook important details here, and is it really by Damereau-Levenshtein et al. that all this would have to be done, instead?

Good heavens, even I could do this, in AHK, and with triggering the basic algorithm, i.e. the "compare tool part", again and again, for each text block, from within my replicate-then-delete-all-identical-parts-consecutively script! (Of course, in such a hybrid thing, there would be a lot of time-consuming shuffling around, since the "compare tool part" triggered again and again would need to be fed with just the separate text BITS from both files, each time.)

At this moment, I can only trial FREE tools, since I don't have time to do a complete image of my system (and I want to stay able to trial these tools later on, too), but it seems most tools do NOT do it (correctly): E.g., I just trialled WinMerge, which is said to do it, with the appropriate setting ("Edit - Options - Compare - General - Enable moved block detection"), and with both yes and no, it does NOT do it, and btw, it does not ask for the necessary block divider code character either...

Other possible candidates: XDiff, XMerge, some parts within Tortoise (complicated system), ExamDiff Pro, and finally Diffuse which is said to do it - for any of these, and CompareIt! of course, the question remains if they are just able to handle moved LINES, or really BLOCKS, and without a given, predetermined block identifier character, I think results will be as aleatoric as they are in that so-much praised WinMerge.

At the end of the day, it could be that there is NOT A SINGLE tool that does it correctly, and it's obvious many people over-hype (if such a thing is possible) free sw just because it's free (WinMerge, e.g. has navigation to first, prev and last difference, but lacks navigation to next difference, as incredible as that might look...).

EDIT: Well, I should add two things that go without saying, but for the sake of a "complete concept": first, that of course, the algorithm "doing" all these blocks can be much simpler than the regular algorithm that afterwards will do the remaining comparison (= time-saving), and second, that of course, we're only into "total identity or not" here, so it's obvious this sub-routine will speed on to the next "record" as soon as it detects the tiniest possible dissimilarity; both simplifications are NOT possible if I have to use a "regular", "standard" compare tool within a macro, for this part of the script.

« Last Edit: July 14, 2013, 01:27 PM by evamaria »

Ath

Supporting Member
Joined in 2006
Posts: 3,640

Re: The problem with text compare tools - similar, in database compare tools

« Reply #7 on: July 14, 2013, 01:25 PM »

Can you post examples of files that have some moved blocks, so we can do some tests ourselves?

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #8 on: July 14, 2013, 01:36 PM »

Oops, that'd be difficult. Of course, my problem arises because of my having left behind my AC/DC converter AND my backup stick elsewhere, some weeks ago, so I had to work on less recent backups, and now I'm facing a 960 items, multi-MB outliner file and its dump into ".txt", ditto from May, ditto from July(fromMay), and from June, with heavy moves of items between May and June: so this is the (unshareable) real-life prob / file collection here; a dummy thing would be some work of any author-within-the-public-domain, and then some chapters mixed-up. Will try and find such a text, and then mix up chapters, and try to upload the mixed-up version and give a link to the original version. Problem is to find something truly without any rights, in order to not "get problems"... ;-)

Ok, I found "wikisource.org", and then, a short story by E A Poe will do. Will try

Update: Just a poem of some 15 lines. Several other sites do pretend to have a "copyright" on the texts there: It's obvious they inserted some typos within the texts they copied from elsewhere, anyway... So this is high-risk territory.

Does anybody know "safe" sites, where not only the classic authors, but the published texts, too, are within the public domain?

Ok, I found at last: gutenberg.org

Ok, the poems of Edgar Allan Poe, two times, in the version gutenberg.org publishes them, and then mixed up, both in plain .txt format; this second version is slightly broader since deliberately, I inserted one bit from the original, twice into the mixed-up version, in different places.

I did NOT scramble minor parts, too, but only mixed up "complete entities, i.e. bits beginning by

* * * * *

and then down to (and including) their last line above the next " * * * * *".

Thus, with a "separator" like " * * * * *" or just like "*", this would be a really difficult file comparison for most "differs".

E A Poe's Poems (source gutenberg.org).txt (400.39 kB - downloaded 337 times.)

E A Poe's Poems mixed up (original source gutenberg.org).txt (400.51 kB - downloaded 341 times.)

« Last Edit: July 14, 2013, 02:13 PM by evamaria »

wraith808

Supporting Member
Joined in 2006
Posts: 11,193

Re: The problem with text compare tools - similar, in database compare tools

« Reply #9 on: July 14, 2013, 01:50 PM »

There are many such potentially incredibly helpful threads there that then are (often very quickly) closed by some blockwart - from its potential, it could be the best site there is, but this invariable closing down of questions there is almost unbearable. Next step would be to even delete the answers there are when they close down the thread.

Here, the closing down is particularly "amusing" since"Rodolfo", to justify this hangman behavior, specifically ask for a "another" question that in this particular instance HAS been asked, so total idiocy here is coupled with their recurrent ordinary fascism. Excuse my very open words, but that could be the finest site in the web for programmers, but it is constantly crippled by admins that are the worst a**ho*** in the web - and no, I never published a question or an answer there or tried to do so, but it's just too revolting, just by reading there.
-evamaria (July 14, 2013, 06:16 AM)

One of the reason it (and other StackExchange sites) is so valued is because it sticks to a particular vision and is excellent withing that vision. So calling someone 'fascist' for moderating that vision seems a bit... extreme. Especially when the difference between the people that are voting and you is only time spent building the site by being active adding content (questions and answers).

And I say this as someone who would have voted to close such a question had I the reputation on that SE.

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #10 on: July 14, 2013, 03:42 PM »

(Here, originally rant about the Snowden affair.)

« Last Edit: July 15, 2013, 11:37 AM by evamaria »

wraith808

Supporting Member
Joined in 2006
Posts: 11,193

Re: The problem with text compare tools - similar, in database compare tools

« Reply #11 on: July 14, 2013, 05:57 PM »

trap? accounts closing down?

...O...k...

I was just speaking against calling someone a fascist for modding a board. Sort of lessens the use of the word and is a pejorative that's really not needed.

* wraith808 backs away slowly...

Tinman57

Charter Member
Joined in 2006
Posts: 1,702

Re: The problem with text compare tools - similar, in database compare tools

« Reply #12 on: July 14, 2013, 10:02 PM »

I always use CsDiff and VData, which are both free. I use them to compare files between hard drives and CD/DVD's to make sure the files are the same.....

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #13 on: July 15, 2013, 11:08 AM »

wraith808, you are perfectly right, and of course, there are BIG differences, and it's not true that "with such character traits, you could run a concentration camp, too", as some German politician once said about some adversary. It's just that I have a problem with closing down threads (over there) that undeniably contain (sometimes highly) constructive content, for being "not constructive", AND I have a VERY BIG problem with what we've seen from ALL our European governments, these last weeks, when at least two thirds of our population(s? can't speak so much for other countries than Germany) would have much liked to grant asylum to Edward Snowden, and when all we see is EVERY government out there would like to "get him", in order to make a package out of him straight to Obama, the Nobel prize. So I perfectly all mixed it up here, will try to export my political views to a second thread, in order to clean this up as far as possible. To do something constructive:

(Tinman57, you're not speaking of proper moved-blocks processing, which is the subject here?) ;-)

So, problem and askSam solution

As said, I hadn't got my newest outliner files at hand; I only had backups, and not the newest ones.

I assumed they were sufficiently recent - I was wrong for two big files; but I was smart enough, at least, to work on copies, not on the old originals, and within a specific, new folder to which I copied the files I needed; it would have been much smarter, though (see special prob below), to classifiy these copies as "read-only", since for any new stuff, I had created a new file "NEW" anyway.

Now, my file "c" (= in fact, my "Computer-related things" "Inbox" (!), always much too big) is in 3 versions:
- May (964 items, the old backup)
- MayJuly (964 items on start, a copy of "May" I worked on in July, and 964 items in the end = pure coincidence)
- June (497 items, the current file before doing that mess in MayJuly (and after having moved out, in June, many items to other files)

Special problem: I had mistakenly assumed the old May backup was rather recent (or rather, after some days, I simply forgot it was as old as it was...), so I moved lots of files within the MayJuly file, which would of course have been to be avoided, so I worsened my original problem considerably.

Task: Identify any item in the MayJuly file that had either been altered or added, and deleted files here, compared with the May version, and replicate the changes / additions within the June file, and taking care the aforementioned special problem of the order / position of many items having been mixed up between May and MayJuly.

My outliner allows for exporting items (title and content) as plain text, and for separating items by a code sign before each title; it does NOT allow for exporting items in a flattened-out, then sorted "list" form. (Remark: Some (especially db-based) outliners will allow for identifying changed/added items within the application itself, so you will not need external tools in such cases, but we try to find a solution for any case where this easy way out is not possible, i.e. in "lesser" outliners, in editors, or in any text processing application.)

Solution:

In your outliner or text file:

- Do the plain text export, for May and for MayJuly file, with a Yen sign (Alt 0165) before each item.

In askSam:

- Open a "blank" file in AS (if you don't have a license, the trial version will do)

- "File-Import", then "after the last item" (there is an AS header item there, so after import, there should be 1 plus 964 = 965 items); other setting there: "Document delimiter: STRING", then Alt 0165 or whatever, and you also can check the box "remove string"); file origin is Win Ansi (default), but the important thing here is, for every new import, you have to enter the "string: alt..." anew: it's AS, there's nothing intuitive here... - after import, you'll get a message: "Imported: 964 items" (and so, in this case, the AS file comprises 965 records)

- Do the same for the second file, i.e. open a second blank file in AS, and import your second .txt file here (as said, pay attention to enter the string delimiter again)

Then in AS, for the first file, then for the second one:

- If my description does not suffice, there is a subject "Saving a file in sorted order" in the AS help. So:

- Do not anything in the line of "Actions-Sort", but:

- Do the menu "File"-"Export"-"Select Documents" (= NOT "entire file")
- Click button "Clear All" if not greyed out
- Click button "Sort : None" (sic - we're in AS country here..., and don't click "Help": it's NOT context-sensitive...)

- Now, in the "Sort" dialog, click in the grey field beneath "Sort on", then select "first line in document" (this will also show "Type:text" and "Order:ascending"
- Click OK

- This will bring you back to the "Search" dialog, where it always shows "There are no items to show" (= not intuitive, rather deterring, but it's simply the info that your previous "ok" just made settings, but didn't trigger the actual sort yet), but where the button changed to "Sort Defined"
- Click OK here, again (Don't click on "Clear All" now, since the Sort button would revert to "Sort: None", of course)
- This brings the "Export" dialog, finally, into which you'll enter the target file name, and again "Ok"

- You'll get the message "Exported: 965" (or whatever: number of your items plus the one "header" item exported, too)

You'll do all this for both files, then import them into the "differ" of your choice.

Now, perhaps a new problem arises: AS will have sorted your items just by the very first line of them, which is, by the outliner export above, the title you gave to your items within the outliner; similar if you exported from within an editor: first line.

Problem in my case, I often title items "again", or "ditto" or such, for indented (sub-) items, and here, a "sorting tool" like AS creates chaos, since there is no further, sub-sorting, by line 2, 3. In fact, I've got dozens of such "ditto" 's among these, so my "sorted" .txt files didn't work any better in my "differ" than the original ones, and I needed several hours to work it out, "manually" - it would have been smarter to number all my ditto's first, then sort, then take those unwanted numbers there away again... - This problem also shows the importance of sorting QUALITY, i.e. of the need, for good sort, to not do it just by one criterion but by several. (AS is not really to blame here, since if you sort by field content there, it lets you combine several fields.)

But then, in a programming environment, you will probably give "expressive" names to your routines, etc., so as to not have identically-named "items" to be sorted then. As said, in case of necessity, do it with out-commented lines before the beginning of the routine, etc., then using the special comment character you use here, as the record divider character; with "special comment character", I mean the ordinary comment character, plus a second character, e.g. a ";", plus a second ";" here (whilst for ordinary comments, you'll just use the single ";" character), since AS accepts strings as "record code" - if you do something similar with a tool that only accepts ONE such "record code" character, for programming, you will have to do the same, within your code, i.e. do the regular comment character, plus a second special character, i.e. double the comment character or anything else... but before export from your editor or such or before import into the sorting tool, you'll have to replace the "double comment char" with something special again, since you would certainly not want the sorting tool to separate ordinary comments from code because it "thinks" there is a "new record code".

So this is a viable solution in case we will not discover a "differ" doing "block identification" in a really reliable way; in my case, the only problems I encountered, arose from my identical naming of different items, hence the need to pay attention to make these very first lines of your code entities distinct: If they are code, they will be automatically, but then, the "new record code" will be difficult, so it will be a comment line, and so some numbering will perhaps do the necessary distinction. In most outliners, you will be able to put the special character before your items, afterwards (as seen above), but even in an editor or such, in order to properly "fold" your code by various criteria, properly "encoded" comment lines starting each code block will certainly be a good idea: first, the comment char, then some special characters by which you will fold your code in different ways, then any "real comment" here.

This remains me: Some writer (I believe it was a novelist) mentioned somewhere a programmer friend of his tweaked KEdit for him to write all his stuff within just this editor (which I explained elsewhere in this forum just some months ago since it's one of the best (and weirdest, yes) you can get), and I'm sure that writer does use such "code chars", too, e.g. for chapters, subchapters, etc., and even paragraphs to be refined, etc., in order to "fold" by them (and no, he didn't explain his system any further, so I don't see the need for searching the relevant links).

evamaria

Participant
Joined in 2013
Posts: 61

Re: The problem with text compare tools - similar, in database compare tools

« Reply #14 on: July 15, 2013, 11:08 AM »

See simultaneous, previous post!

Programming solution of the original problem

We have to distinguish between "Moved Lines"

from what "they" say, these programs can process them "properly" (this notion will have to be discussed later):

ExamDiff Pro, Code Compare, Meld, WinDiff, WinMerge, UCC, Compare Suite (?), XDiff/XMerge

As said, I tried WinMerge to no avail, but with blocks, so perhaps with just single lines (and see below)...

and "Moved Blocks"

here, we can assume that no tool will be able to process these properly if it isn't able to process moved lines, to begin with, so this group should be a sub-group of the above.

Also, there might be "Special, recognizable blocks"

Which means, some tools try to recognize the used programming language in your text, and then, they could try to "get" such blocks or whatever such tools then understand by this term, and recognize them when moved... whilst the same tools could completely fail whenever normal "body text" is concerned.

In the case of WinMerge, perhaps this tool is an example of this distinction-to-be-made, but then, it would be helpful if the developers told us something about it in the help file; as it is, WinMerge does NOT recognize moved blocks in my trial.

The Copy vs. Original problem

In my post above, I mused if I had overlooked something important, since the solution I presented, is easy to code, so why the absence of proper moved-block processing in almost all (or perhaps all) relevant tools? I know now:

In fact, my intermediate solution resides on working on at least one copy of the original file, on the side where moved blocks are detected, and then deleted in order to "straighten out" the rest of the, for any "non-moved-blocks" comparison.

On the other hand, all (?) current "differs" use the original files, and (hopefully) let you edit them (both).

The file copying could be automated, i.e. the user selects the original, then the tool creates a copy on which it will then work, so this is not the problem; also, editing on a copy could then be replicated on the original file, but that's not so easy to code anymore.

Now, we have to differenciate between what you see within the panes, and what the tool is "working" on, or rather, let's assume the tool, for both files each, will process, in memory, two files, one with the actual content, the other for what you see on the screen.

Now, I said, the tool should delete all moved blocks (in one of the two files, not in both): Yes, it could do this within a additional, intermediate copy, just for speeding up any block compare later on, OR it could do these compares on parts of the actual file, reading, from a table/array/list, the respective lines to compare the actual block to:

- first, write the block table (array a) for file 1 (by the BlockBeginCode, e.g. Alt 0165): block 1 = lines 1 to 17, block 2 = lines 18 to 48, and so end up to eof (endoffile); also, write the very first chars, perhaps from char 1 to char 10, of the block (our block 2 example: chars 1 to 10 of line 18 of file 1) into the array
- then, write the block table for file 2 (array b), in the same way

- then, FOR EVERY block in file 1 (= for block 1 to block xxx, detailed in the corresponding line x in array a):
-- run array b (= the one for file 2), from line 1 to line yyy, meaning: check, for any line in array b, if the first characters of block y (= the characters from the line "startnumber of block y" in array b, put into the array earlier) are identical with those read from line x in array a
-- and only IF they are identical:
--- on first hit for this x here in 2, read the text of the lines of the block x of file 1 into buffer "a" (for non-programmers, that's a file just existing within working memory), and in any case:
--- both put all lines of this block y of file 2 into a buffer "b",
--- and run a subroutine checking if the respective contents of buffer "a" and of buffer "b" are identical (in practice, you would cut up this subroutine into several ones, first checking the whole first line, or better, the whole first 50 chars, etc., with breaking the whole subroutine on first difference)
--- and IF there is identity between two buffers, the routine would not delete the corresponding block y, but put a "marker" into both line x of array a and line y of array b
-- and just to make this clear, even if there is identity, for this x block, all remaining y blocks will be checked notwithstanding, since it's possible that the user has copied a block, perhaps several times, instead of just moving it
END FOR

- then, the routine would have unchanged files 1 and 2, in its buffers 1 and 2, respectively, but would have all the necessary processing information within its arrays a and b
- it would then create, in buffers 1a and 2a, files 1a and 2a, the respective "display" files onto which then to process

- then, it would not only process both display panes, according to the array info, but also the display of the "changes ribbon" (which most "differs" also display):
- the prog then takes checks for all those "markers" in array a, and will display just a line, instead of the block, saying something like "Block of n lines; has been moved in file 2", or "Block of n lines; was moved/copied 3 times in file 2"
- similar then for all "block markers" in array b: the display just shows a line, saying "Block moved/copied here"

That's all there is to it: Bear in mind the program will constantly access both arrays (which in reality are more complicated than described above), and thus will "know", e.g., that this specific "Block moved" line somewhere in pane 2 both is line 814 of buffer 2a, and in reality represents lines 2,812 to 2,933 of buffer 2 and actual file 2; and so, if the tool is coded in a smart way, it would even be possible to move that "Block moved" line, manually (= by mouse or by arrow key combination), on screen, to let's say line 745 of screen buffer 2a, and then, the program will properly move lines 2,812 to 2,933 of buffer 2, and within buffer 2, right after the line in buffer 2 which corresponds to line 745 in buffer 2a:

As you can see, there is no real problem whatsoever in implementing my way of "moved-blocks processing" into existing "differs"; it's just a little bit more info to be put into those arrays that are already there anyway, in order for all those "other differences" and their encodings to be stored after checking for them, the bread-and-butter tasks of any "differ" out there.

Hence, again, my question, where's the tremendous diffulty I don't see here and which hinders developers of tools like BC, etc., from implementing this feature we all crave for?

David.P

Supporting Member
Joined in 2008
Posts: 208
Ergonomics Junkie

Re: The problem with text compare tools - similar, in database compare tools

« Reply #15 on: April 21, 2015, 05:18 AM »

Any news on this? There still doesn't seem to be any viable tool that supports text block movement recognition -- apart from wcopyfind (which hovewer has a rather terrible output view).

ExamDiff supposedly can recognize moved text, but I find it's output confusing and what's more, it doesn't seem to allow simple text editing inside the comparison window which is a must for me.

Other tools that I tried and that don't recognize move text are Meld (otherwise beautiful), Diffuse, Araxis Merge and Code Compare.

WinMerge supposedly also should recognize moved blocks, however I can't get it to work.

If anyone is aware of another comparison/merge tool that can do moved block recognition, please add it to this thread!

Readability² - Mission accomplished

Shades

Member
Joined in 2006
Posts: 2,948

Re: The problem with text compare tools - similar, in database compare tools

« Reply #16 on: April 21, 2015, 07:41 AM »

The not free version of ExamDiff Pro will let you edit files directly. It also allows you to configure how "fine-grained" its output is. You can even adjust the colors it uses for this.

The only other compare tool I have experience with is 'Beyond Compare', which is on par with Examdiff Pro regarding configuration (the versions I have from both programs are). Not sure about Beyond Compare's ability to recognize moved text blocks. All I know is that Mouser is a very(!) big fan of Beyond Compare.

tomos

Charter Member
Joined in 2006
Posts: 11,965

Re: The problem with text compare tools - similar, in database compare tools

« Reply #17 on: April 21, 2015, 08:09 AM »

FWIW there is a post about the topic here (dc link: scroll down to section "II" for some suggestions/links).
He does talk about Beyond Compare's capabilities re moved text in that, or maybe the previous post but I found it pretty hard to follow on a quick read - I dont *think* it has it (they are long and winding posts).

Tom

David.P

Supporting Member
Joined in 2008
Posts: 208
Ergonomics Junkie

Re: The problem with text compare tools - similar, in database compare tools

« Reply #18 on: April 21, 2015, 09:37 AM »

Thanks guys. While I'm not sure what the rant over there exactly is about

I found that Beyond Compare obviously still does not support detection of moved text blocks.

With ExamDiff Pro, there is the problem that if you enable word wrapping (which I absolutely need to be active), you can't edit the file within ExamDiff Pro

Two tools from the list over in that other thread, SemanticMerge and PerforceMerge, so far look very promising. However, I'm not sure whether their method of "semantically" detecting moved text blocks will work for "normal" text (e.g. like letters, books, essays etc.).

While not offering a side by side view, the wikEd diff tool that seems to be powering the Wikipedia versioning engine, looks extremely powerful and promising. It does not seem to have any difficulties to recognize text that has been moved and at the same time has been changed.

However, with long texts, I find it very hard to read and find my way around the output.

Additionally, I'm not sure what the below options do:

- "Reject blocks if too short and common"
- "Words: reject blocks if shorter"
- "Maximum rejection cycles"
- "Repeated diff"
- "Recursive diff"

...all the more since I can't reproduce any differences in the output when any of these options are checked, or are not checked.

For example, I can't seem to prevent the wikEd diff tool to mark single words as moved when playing around with these options.

Readability² - Mission accomplished

Scott_Y

Supporting Member
Joined in 2011
Posts: 116

Re: The problem with text compare tools - similar, in database compare tools

« Reply #19 on: April 21, 2015, 11:07 AM »

Does anybody have an update on the ability of CompareIt! to detect moved words, lines, or blocks? Thanks.

wraith808

Supporting Member
Joined in 2006
Posts: 11,193

Re: The problem with text compare tools - similar, in database compare tools

« Reply #20 on: April 21, 2015, 11:20 AM »

I haven't chimed in because... I'm not really getting the problem (and the general tone was why I didn't ask before). Any decent diff program has to be able to know if a block has moved. Right?

I took david.p's post above, saved it, then took a block of text and moved it.

This is the result:

The problem with text compare tools - similar, in database compare tools

Is there something obvious and/or simple that I'm missing? Every diff tool that I know of can sense that sort of difference.

David.P

Supporting Member
Joined in 2008
Posts: 208
Ergonomics Junkie

Re: The problem with text compare tools - similar, in database compare tools

« Reply #21 on: April 21, 2015, 11:35 AM »

Does anybody have an update on the ability of CompareIt! to detect moved words, lines, or blocks?
-Scott_Y (April 21, 2015, 11:07 AM)

It seems that CompareIt sort of detects entire moved paragraphs, but only if they are verbatim the same. As soon as you move a paragraph and make whatever little change to it, it is not detected as a move anymore. Also, CompareIt does not seem to detect smaller moved sections of text, only entire paragraphs.

Every diff tool that I know of can sense [moved text blocks].
-wraith808 (April 21, 2015, 11:20 AM)

Of course, every diff tool will highlight moved text in some way as having been changed (like deleted/added). However, almost none of the known tools actually is able to show that moved text actually has been moved (and from where to where it has been moved). Instead, most tools show moved text as having been deleted at the original spot, and added at the target spot, without showing that this deletion/addition pair actually belongs together.

Readability² - Mission accomplished

wraith808

Supporting Member
Joined in 2006
Posts: 11,193

Re: The problem with text compare tools - similar, in database compare tools

« Reply #22 on: April 21, 2015, 01:27 PM »

If you look at my screenshot, I've squished the map bar... but you see the red line connecting the two blocks of text? That's exactly what that's saying. And every diff tool that *I've* used does that.

Another example from Code Compare

The problem with text compare tools - similar, in database compare tools

David.P

Supporting Member
Joined in 2008
Posts: 208
Ergonomics Junkie

Re: The problem with text compare tools - similar, in database compare tools

« Reply #23 on: April 21, 2015, 02:01 PM »

Your last screenshot shows one deletion and one insertion -- which is indeed what all "stupid" compare tools do when it actually is not a deletion/insertion but a moved block. The CompareIt screenshot shows movement, but as I said it fails as soon as you change like one character in the moved text.

When you have lots of moved sentences or paragraphs in a long text, and additionally small changes within those moved parts, you have no chance to see what is where with this kind of difference recognition and display.

Code Compare allegedly can detect moved blocks, but only for some programming languages -- not for general text.

Readability² - Mission accomplished

Author Topic: The problem with text compare tools - similar, in database compare tools (Read 20389 times)