EDIT: Sorry, Shades, I had overlooked your first post mentioning ExamDiff Pro, and indeed, it's on my list in order to trial, as the other very few mentioned below.
Thank you, Shades, for sharing your experience. In fact, I'm heavily googling for such db dump and compare tools, and there are several ones that bear 2 price tags: structure only, and then structure plus content, so I thought about dumps indeed, but such a dump then would be nothing more than I have got with my text files and their item = "records" divider code character: Those "records", then, are in total "disorder", from any text compare tool's pov, so putting my things into a db (which will be a problem in itself), then extracting it, dump-wise, will not get me many steps forward!
"Not many steps", instead of "no step", because, in fact, there IS some step in the right direction here: In fact, many text compare tools and programmable editors and such allow for SORTING LINES, but as you will have understood, my items ain't but just single lines, but bulks of lines, so any "sort line, then compare" idea is to be discarded.
And here, a db and its records would indeed help, since in a db, you can sort records, not just sort lines, and that's why...
HEUREKA for my specific problem at least:
- I can import my "text dump" (= text file, from my 1,000 "items", divided by a code character into "records") from my file 1, into an askSam file.
- ditto for my file 2.
- I then sort both AS files by first line of their records.
- I then export both AS files into respective text files.
- I then can import those NEW text files into ANY text compare tool (e.g. BC that I "know"), and then, only some 10 or so new or altered items would be shown, not 200 moved ones.
- The same would be possible with any sql db and then its sorted records, dumped into text files, then compared by any tool.
- In any case, this is a lot of fiddling around (and hoping that buggy AS will not mix up things but will work according to its claims).
Back to the original problem:
My googling for "recognize moved text blocks" gave these hits, among others:
http://www.huffingto...and-s_b_2754804.html= "6 Ways to Recognize And Stop Dating A Narcissist" (= hit number 43 or 44)
Now hold your breath: This got 1,250 comments! (= makes quite a difference from dull comp things and most threads here...)
Very funny article, by the way: laughed a lot!
http://www.velocityr...locks-detection.htmlSo we are into Damereau-Levenshtein country here (not that I understand a mumbling word of this...), of "time penalties", and of "Edit distance"
(
http://en.wikipedia....g/wiki/Edit_distance )
(Perhaps, if I search long enough for "edit distance", I'll get back to another Sandy Weiner article? "Marriages come and go, but divorce is forever", she claims... well, there are exceptions to every rule, RIP Taylor/Burton as just ONE example here)
Also of interest in this context, "Is there an algorithm that can match approximate blocks of text?" (well, this would be another problem yet):
http://digitalhumani...imate-blocks-of-textAnd also, this Shapira/Storer paper, "Edit Distance with Move Operations":
http://www.cs.brande...cations/JDAmoves.pdf(Very complicated subject of which I don't understand but that my, call it "primitive", way, is definitely sidestepping this problem: equal-block elimination (!) first, THEN any remaining comparison:)
Now, not being a programmer (and thus, not knowing "any better"), I would attack the problem like this, for whenever the setting for this speciality is "on":
- have a block separator code (= have the user enter it, then the algorithm knows what to search for)
- look for this code within file 1, build a list for the lines (= block 1 = lines 1 to 23, block 2 = lines 24 to 58, etc.)
for any block "a" in file 1 (see the list built up before):
- apply the regular / main (line-based) algorithm, FOR block "a" in file 1, TO any of the blocks 2 to xyz in file 2
- mark any of the "hit" blocks in file 2 as "do not show and process any further"
- this also means, no more processing of these file 2 blocks for further comparisons with further file 1 blocks
end for
- from block 2, show and process only any lines there that have NOT been already discarded by the above
(for other non-programmers: all this is easily obtained by first duplicating the original file 2, then just deleting the "hit" parts of that duplicate, progressively)
So where, good heavens, are the difficulties in such an option / sub-routine, by which otherwise fine progs like BC (where users have been begging since 2005, and where the developers always say, "it's on our wish list" - what?! a "wish list", not for users, but for the developers? I'd always thought their wish list was called a "road map"! or who's deemed to do the work, for these developers, then?!) or Araxis Merge (about 300 bucks, quand même !) seem to be unable to realize it?
Do I overlook important details here, and is it really by Damereau-Levenshtein et al. that all this would have to be done, instead?
Good heavens, even I could do this, in AHK, and with triggering the basic algorithm, i.e. the "compare tool part", again and again, for each text block, from within my replicate-then-delete-all-identical-parts-consecutively script! (Of course, in such a hybrid thing, there would be a lot of time-consuming shuffling around, since the "compare tool part" triggered again and again would need to be fed with just the separate text BITS from both files, each time.)
At this moment, I can only trial FREE tools, since I don't have time to do a complete image of my system (and I want to stay able to trial these tools later on, too), but it seems most tools do NOT do it (correctly): E.g., I just trialled WinMerge, which is said to do it, with the appropriate setting ("Edit - Options - Compare - General - Enable moved block detection"), and with both yes and no, it does NOT do it, and btw, it does not ask for the necessary block divider code character either...
Other possible candidates: XDiff, XMerge, some parts within Tortoise (complicated system), ExamDiff Pro, and finally Diffuse which is said to do it - for any of these, and CompareIt! of course, the question remains if they are just able to handle moved LINES, or really BLOCKS, and without a given, predetermined block identifier character, I think results will be as aleatoric as they are in that so-much praised WinMerge.
At the end of the day, it could be that there is NOT A SINGLE tool that does it correctly, and it's obvious many people over-hype (if such a thing is possible) free sw just because it's free (WinMerge, e.g. has navigation to first, prev and last difference, but lacks navigation to next difference, as incredible as that might look...).
EDIT: Well, I should add two things that go without saying, but for the sake of a "complete concept": first, that of course, the algorithm "doing" all these blocks can be much simpler than the regular algorithm that afterwards will do the remaining comparison (= time-saving), and second, that of course, we're only into "total identity or not" here, so it's obvious this sub-routine will speed on to the next "record" as soon as it detects the tiniest possible dissimilarity; both simplifications are NOT possible if I have to use a "regular", "standard" compare tool within a macro, for this part of the script.