The problem with text compare tools - similar, in database compare tools

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

<< < (2/5) > >>

Shades:
Comparing databases is done (in most cases), because a DBA suspects a change in structure, which can be very problematic for software especially written for that particular database structure. Although one should not program like this, I think it can result in faster applications when one already knows exactly where to look/write.

For Oracle databases there are tools that can also do content comparison, some are basic and free, but in most cases expect to pay dearly (especially with Oracle products). With MySQL it is most of the time not that difficult to make a complete extract (structure and content) into a .sql file and which can be easily checked with a text comparison tool for differences.
I use XAMPP which comes with MySQL, a webserver and the PHPMyAdmin scripts. With that it is easy to create a complete dump of any MySQL database.

I have no personal experience with other databases (only some minor PostgreSQL), such as Microsoft and IBM database products, but I suspect that tools for these databases are more similar in functionality and price with the tools for Oracle products than for MySQL.

A long(er) shot would be to use the 'Oracle SQL Developer' software for both types of comparisons. This software you get for free when installing the latest Oracle client and/or database software (you can also fish this software out of the installation archive and run it as a portable application. All the necessary Java software is included, you do not need to install Java, you just have to configure it once). It is not included in Oracle XE versions, last time I checked.

If memory serves me right, this software is designed for Oracle databases (duh!), it is also able to handle Microsoft Access and as MySQL is now part of the Oracle family it should be able to work with these databases too.

Granted, the above solution doesn't sound simple and easy, but at least it is free. If you are interested , you could PM me with an FTP/dropbox location where I can dump an archive from the Oracle SQL Developer software I use on my system.

evamaria:
EDIT: Sorry, Shades, I had overlooked your first post mentioning ExamDiff Pro, and indeed, it's on my list in order to trial, as the other very few mentioned below.

Thank you, Shades, for sharing your experience. In fact, I'm heavily googling for such db dump and compare tools, and there are several ones that bear 2 price tags: structure only, and then structure plus content, so I thought about dumps indeed, but such a dump then would be nothing more than I have got with my text files and their item = "records" divider code character: Those "records", then, are in total "disorder", from any text compare tool's pov, so putting my things into a db (which will be a problem in itself), then extracting it, dump-wise, will not get me many steps forward!

"Not many steps", instead of "no step", because, in fact, there IS some step in the right direction here: In fact, many text compare tools and programmable editors and such allow for SORTING LINES, but as you will have understood, my items ain't but just single lines, but bulks of lines, so any "sort line, then compare" idea is to be discarded.

And here, a db and its records would indeed help, since in a db, you can sort records, not just sort lines, and that's why...

HEUREKA for my specific problem at least:

- I can import my "text dump" (= text file, from my 1,000 "items", divided by a code character into "records") from my file 1, into an askSam file.
- ditto for my file 2.

- I then sort both AS files by first line of their records.

- I then export both AS files into respective text files.

- I then can import those NEW text files into ANY text compare tool (e.g. BC that I "know"), and then, only some 10 or so new or altered items would be shown, not 200 moved ones.

- The same would be possible with any sql db and then its sorted records, dumped into text files, then compared by any tool.

- In any case, this is a lot of fiddling around (and hoping that buggy AS will not mix up things but will work according to its claims).

Back to the original problem:

My googling for "recognize moved text blocks" gave these hits, among others:

http://www.huffingtonpost.com/sandy-weiner/6-ways-to-recognize-and-s_b_2754804.html
= "6 Ways to Recognize And Stop Dating A Narcissist" (= hit number 43 or 44)
Now hold your breath: This got 1,250 comments! (= makes quite a difference from dull comp things and most threads here...)
Very funny article, by the way: laughed a lot!

http://www.velocityreviews.com/forums/t751417-difflib-like-library-supporting-moved-blocks-detection.html
So we are into Damereau-Levenshtein country here (not that I understand a mumbling word of this...), of "time penalties", and of "Edit distance"
( http://en.wikipedia.org/wiki/Edit_distance )
(Perhaps, if I search long enough for "edit distance", I'll get back to another Sandy Weiner article? "Marriages come and go, but divorce is forever", she claims... well, there are exceptions to every rule, RIP Taylor/Burton as just ONE example here)

Also of interest in this context, "Is there an algorithm that can match approximate blocks of text?" (well, this would be another problem yet):

http://digitalhumanities.org/answers/topic/is-there-an-algorithm-that-can-match-approximate-blocks-of-text

And also, this Shapira/Storer paper, "Edit Distance with Move Operations":

http://www.cs.brandeis.edu//~shapird/publications/JDAmoves.pdf

(Very complicated subject of which I don't understand but that my, call it "primitive", way, is definitely sidestepping this problem: equal-block elimination (!) first, THEN any remaining comparison:)

Now, not being a programmer (and thus, not knowing "any better"), I would attack the problem like this, for whenever the setting for this speciality is "on":

- have a block separator code (= have the user enter it, then the algorithm knows what to search for)
- look for this code within file 1, build a list for the lines (= block 1 = lines 1 to 23, block 2 = lines 24 to 58, etc.)

for any block "a" in file 1 (see the list built up before):
- apply the regular / main (line-based) algorithm, FOR block "a" in file 1, TO any of the blocks 2 to xyz in file 2
- mark any of the "hit" blocks in file 2 as "do not show and process any further"
- this also means, no more processing of these file 2 blocks for further comparisons with further file 1 blocks
end for

- from block 2, show and process only any lines there that have NOT been already discarded by the above
(for other non-programmers: all this is easily obtained by first duplicating the original file 2, then just deleting the "hit" parts of that duplicate, progressively)

So where, good heavens, are the difficulties in such an option / sub-routine, by which otherwise fine progs like BC (where users have been begging since 2005, and where the developers always say, "it's on our wish list" - what?! a "wish list", not for users, but for the developers? I'd always thought their wish list was called a "road map"! or who's deemed to do the work, for these developers, then?!) or Araxis Merge (about 300 bucks, quand même !) seem to be unable to realize it?

Do I overlook important details here, and is it really by Damereau-Levenshtein et al. that all this would have to be done, instead?

Good heavens, even I could do this, in AHK, and with triggering the basic algorithm, i.e. the "compare tool part", again and again, for each text block, from within my replicate-then-delete-all-identical-parts-consecutively script! (Of course, in such a hybrid thing, there would be a lot of time-consuming shuffling around, since the "compare tool part" triggered again and again would need to be fed with just the separate text BITS from both files, each time.)

At this moment, I can only trial FREE tools, since I don't have time to do a complete image of my system (and I want to stay able to trial these tools later on, too), but it seems most tools do NOT do it (correctly): E.g., I just trialled WinMerge, which is said to do it, with the appropriate setting ("Edit - Options - Compare - General - Enable moved block detection"), and with both yes and no, it does NOT do it, and btw, it does not ask for the necessary block divider code character either...

Other possible candidates: XDiff, XMerge, some parts within Tortoise (complicated system), ExamDiff Pro, and finally Diffuse which is said to do it - for any of these, and CompareIt! of course, the question remains if they are just able to handle moved LINES, or really BLOCKS, and without a given, predetermined block identifier character, I think results will be as aleatoric as they are in that so-much praised WinMerge.

At the end of the day, it could be that there is NOT A SINGLE tool that does it correctly, and it's obvious many people over-hype (if such a thing is possible) free sw just because it's free (WinMerge, e.g. has navigation to first, prev and last difference, but lacks navigation to next difference, as incredible as that might look...).

EDIT: Well, I should add two things that go without saying, but for the sake of a "complete concept": first, that of course, the algorithm "doing" all these blocks can be much simpler than the regular algorithm that afterwards will do the remaining comparison (= time-saving), and second, that of course, we're only into "total identity or not" here, so it's obvious this sub-routine will speed on to the next "record" as soon as it detects the tiniest possible dissimilarity; both simplifications are NOT possible if I have to use a "regular", "standard" compare tool within a macro, for this part of the script.

Ath:
Can you post examples of files that have some moved blocks, so we can do some tests ourselves?

evamaria:
Oops, that'd be difficult. Of course, my problem arises because of my having left behind my AC/DC converter AND my backup stick elsewhere, some weeks ago, so I had to work on less recent backups, and now I'm facing a 960 items, multi-MB outliner file and its dump into ".txt", ditto from May, ditto from July(fromMay), and from June, with heavy moves of items between May and June: so this is the (unshareable) real-life prob / file collection here; a dummy thing would be some work of any author-within-the-public-domain, and then some chapters mixed-up. Will try and find such a text, and then mix up chapters, and try to upload the mixed-up version and give a link to the original version. Problem is to find something truly without any rights, in order to not "get problems"... ;-)

Ok, I found "wikisource.org", and then, a short story by E A Poe will do. Will try

Update: Just a poem of some 15 lines. Several other sites do pretend to have a "copyright" on the texts there: It's obvious they inserted some typos within the texts they copied from elsewhere, anyway... So this is high-risk territory.

Does anybody know "safe" sites, where not only the classic authors, but the published texts, too, are within the public domain?

Ok, I found at last: gutenberg.org

Ok, the poems of Edgar Allan Poe, two times, in the version gutenberg.org publishes them, and then mixed up, both in plain .txt format; this second version is slightly broader since deliberately, I inserted one bit from the original, twice into the mixed-up version, in different places.

I did NOT scramble minor parts, too, but only mixed up "complete entities, i.e. bits beginning by

* * * * *

and then down to (and including) their last line above the next " * * * * *".

Thus, with a "separator" like " * * * * *" or just like "*", this would be a really difficult file comparison for most "differs".

wraith808:
There are many such potentially incredibly helpful threads there that then are (often very quickly) closed by some blockwart - from its potential, it could be the best site there is, but this invariable closing down of questions there is almost unbearable. Next step would be to even delete the answers there are when they close down the thread.

Here, the closing down is particularly "amusing" since"Rodolfo", to justify this hangman behavior, specifically ask for a "another" question that in this particular instance HAS been asked, so total idiocy here is coupled with their recurrent ordinary fascism. Excuse my very open words, but that could be the finest site in the web for programmers, but it is constantly crippled by admins that are the worst a**ho*** in the web - and no, I never published a question or an answer there or tried to do so, but it's just too revolting, just by reading there.
-evamaria (July 14, 2013, 06:16 AM)
--- End quote ---

One of the reason it (and other StackExchange sites) is so valued is because it sticks to a particular vision and is excellent withing that vision. So calling someone 'fascist' for moderating that vision seems a bit... extreme. Especially when the difference between the people that are voting and you is only time spent building the site by being active adding content (questions and answers).

And I say this as someone who would have voted to close such a question had I the reputation on that SE.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version