1
General Software Discussion / Re: The problem with text compare tools - similar, in database compare tools
« on: July 15, 2013, 11:08 AM »
See simultaneous, previous post!
Programming solution of the original problem
We have to distinguish between "Moved Lines"
from what "they" say, these programs can process them "properly" (this notion will have to be discussed later):
ExamDiff Pro, Code Compare, Meld, WinDiff, WinMerge, UCC, Compare Suite (?), XDiff/XMerge
As said, I tried WinMerge to no avail, but with blocks, so perhaps with just single lines (and see below)...
and "Moved Blocks"
here, we can assume that no tool will be able to process these properly if it isn't able to process moved lines, to begin with, so this group should be a sub-group of the above.
Also, there might be "Special, recognizable blocks"
Which means, some tools try to recognize the used programming language in your text, and then, they could try to "get" such blocks or whatever such tools then understand by this term, and recognize them when moved... whilst the same tools could completely fail whenever normal "body text" is concerned.
In the case of WinMerge, perhaps this tool is an example of this distinction-to-be-made, but then, it would be helpful if the developers told us something about it in the help file; as it is, WinMerge does NOT recognize moved blocks in my trial.
The Copy vs. Original problem
In my post above, I mused if I had overlooked something important, since the solution I presented, is easy to code, so why the absence of proper moved-block processing in almost all (or perhaps all) relevant tools? I know now:
In fact, my intermediate solution resides on working on at least one copy of the original file, on the side where moved blocks are detected, and then deleted in order to "straighten out" the rest of the, for any "non-moved-blocks" comparison.
On the other hand, all (?) current "differs" use the original files, and (hopefully) let you edit them (both).
The file copying could be automated, i.e. the user selects the original, then the tool creates a copy on which it will then work, so this is not the problem; also, editing on a copy could then be replicated on the original file, but that's not so easy to code anymore.
Now, we have to differenciate between what you see within the panes, and what the tool is "working" on, or rather, let's assume the tool, for both files each, will process, in memory, two files, one with the actual content, the other for what you see on the screen.
Now, I said, the tool should delete all moved blocks (in one of the two files, not in both): Yes, it could do this within a additional, intermediate copy, just for speeding up any block compare later on, OR it could do these compares on parts of the actual file, reading, from a table/array/list, the respective lines to compare the actual block to:
- first, write the block table (array a) for file 1 (by the BlockBeginCode, e.g. Alt 0165): block 1 = lines 1 to 17, block 2 = lines 18 to 48, and so end up to eof (endoffile); also, write the very first chars, perhaps from char 1 to char 10, of the block (our block 2 example: chars 1 to 10 of line 18 of file 1) into the array
- then, write the block table for file 2 (array b), in the same way
- then, FOR EVERY block in file 1 (= for block 1 to block xxx, detailed in the corresponding line x in array a):
-- run array b (= the one for file 2), from line 1 to line yyy, meaning: check, for any line in array b, if the first characters of block y (= the characters from the line "startnumber of block y" in array b, put into the array earlier) are identical with those read from line x in array a
-- and only IF they are identical:
--- on first hit for this x here in 2, read the text of the lines of the block x of file 1 into buffer "a" (for non-programmers, that's a file just existing within working memory), and in any case:
--- both put all lines of this block y of file 2 into a buffer "b",
--- and run a subroutine checking if the respective contents of buffer "a" and of buffer "b" are identical (in practice, you would cut up this subroutine into several ones, first checking the whole first line, or better, the whole first 50 chars, etc., with breaking the whole subroutine on first difference)
--- and IF there is identity between two buffers, the routine would not delete the corresponding block y, but put a "marker" into both line x of array a and line y of array b
-- and just to make this clear, even if there is identity, for this x block, all remaining y blocks will be checked notwithstanding, since it's possible that the user has copied a block, perhaps several times, instead of just moving it
END FOR
- then, the routine would have unchanged files 1 and 2, in its buffers 1 and 2, respectively, but would have all the necessary processing information within its arrays a and b
- it would then create, in buffers 1a and 2a, files 1a and 2a, the respective "display" files onto which then to process
- then, it would not only process both display panes, according to the array info, but also the display of the "changes ribbon" (which most "differs" also display):
- the prog then takes checks for all those "markers" in array a, and will display just a line, instead of the block, saying something like "Block of n lines; has been moved in file 2", or "Block of n lines; was moved/copied 3 times in file 2"
- similar then for all "block markers" in array b: the display just shows a line, saying "Block moved/copied here"
That's all there is to it: Bear in mind the program will constantly access both arrays (which in reality are more complicated than described above), and thus will "know", e.g., that this specific "Block moved" line somewhere in pane 2 both is line 814 of buffer 2a, and in reality represents lines 2,812 to 2,933 of buffer 2 and actual file 2; and so, if the tool is coded in a smart way, it would even be possible to move that "Block moved" line, manually (= by mouse or by arrow key combination), on screen, to let's say line 745 of screen buffer 2a, and then, the program will properly move lines 2,812 to 2,933 of buffer 2, and within buffer 2, right after the line in buffer 2 which corresponds to line 745 in buffer 2a:
As you can see, there is no real problem whatsoever in implementing my way of "moved-blocks processing" into existing "differs"; it's just a little bit more info to be put into those arrays that are already there anyway, in order for all those "other differences" and their encodings to be stored after checking for them, the bread-and-butter tasks of any "differ" out there.
Hence, again, my question, where's the tremendous diffulty I don't see here and which hinders developers of tools like BC, etc., from implementing this feature we all crave for?
Programming solution of the original problem
We have to distinguish between "Moved Lines"
from what "they" say, these programs can process them "properly" (this notion will have to be discussed later):
ExamDiff Pro, Code Compare, Meld, WinDiff, WinMerge, UCC, Compare Suite (?), XDiff/XMerge
As said, I tried WinMerge to no avail, but with blocks, so perhaps with just single lines (and see below)...
and "Moved Blocks"
here, we can assume that no tool will be able to process these properly if it isn't able to process moved lines, to begin with, so this group should be a sub-group of the above.
Also, there might be "Special, recognizable blocks"
Which means, some tools try to recognize the used programming language in your text, and then, they could try to "get" such blocks or whatever such tools then understand by this term, and recognize them when moved... whilst the same tools could completely fail whenever normal "body text" is concerned.
In the case of WinMerge, perhaps this tool is an example of this distinction-to-be-made, but then, it would be helpful if the developers told us something about it in the help file; as it is, WinMerge does NOT recognize moved blocks in my trial.
The Copy vs. Original problem
In my post above, I mused if I had overlooked something important, since the solution I presented, is easy to code, so why the absence of proper moved-block processing in almost all (or perhaps all) relevant tools? I know now:
In fact, my intermediate solution resides on working on at least one copy of the original file, on the side where moved blocks are detected, and then deleted in order to "straighten out" the rest of the, for any "non-moved-blocks" comparison.
On the other hand, all (?) current "differs" use the original files, and (hopefully) let you edit them (both).
The file copying could be automated, i.e. the user selects the original, then the tool creates a copy on which it will then work, so this is not the problem; also, editing on a copy could then be replicated on the original file, but that's not so easy to code anymore.
Now, we have to differenciate between what you see within the panes, and what the tool is "working" on, or rather, let's assume the tool, for both files each, will process, in memory, two files, one with the actual content, the other for what you see on the screen.
Now, I said, the tool should delete all moved blocks (in one of the two files, not in both): Yes, it could do this within a additional, intermediate copy, just for speeding up any block compare later on, OR it could do these compares on parts of the actual file, reading, from a table/array/list, the respective lines to compare the actual block to:
- first, write the block table (array a) for file 1 (by the BlockBeginCode, e.g. Alt 0165): block 1 = lines 1 to 17, block 2 = lines 18 to 48, and so end up to eof (endoffile); also, write the very first chars, perhaps from char 1 to char 10, of the block (our block 2 example: chars 1 to 10 of line 18 of file 1) into the array
- then, write the block table for file 2 (array b), in the same way
- then, FOR EVERY block in file 1 (= for block 1 to block xxx, detailed in the corresponding line x in array a):
-- run array b (= the one for file 2), from line 1 to line yyy, meaning: check, for any line in array b, if the first characters of block y (= the characters from the line "startnumber of block y" in array b, put into the array earlier) are identical with those read from line x in array a
-- and only IF they are identical:
--- on first hit for this x here in 2, read the text of the lines of the block x of file 1 into buffer "a" (for non-programmers, that's a file just existing within working memory), and in any case:
--- both put all lines of this block y of file 2 into a buffer "b",
--- and run a subroutine checking if the respective contents of buffer "a" and of buffer "b" are identical (in practice, you would cut up this subroutine into several ones, first checking the whole first line, or better, the whole first 50 chars, etc., with breaking the whole subroutine on first difference)
--- and IF there is identity between two buffers, the routine would not delete the corresponding block y, but put a "marker" into both line x of array a and line y of array b
-- and just to make this clear, even if there is identity, for this x block, all remaining y blocks will be checked notwithstanding, since it's possible that the user has copied a block, perhaps several times, instead of just moving it
END FOR
- then, the routine would have unchanged files 1 and 2, in its buffers 1 and 2, respectively, but would have all the necessary processing information within its arrays a and b
- it would then create, in buffers 1a and 2a, files 1a and 2a, the respective "display" files onto which then to process
- then, it would not only process both display panes, according to the array info, but also the display of the "changes ribbon" (which most "differs" also display):
- the prog then takes checks for all those "markers" in array a, and will display just a line, instead of the block, saying something like "Block of n lines; has been moved in file 2", or "Block of n lines; was moved/copied 3 times in file 2"
- similar then for all "block markers" in array b: the display just shows a line, saying "Block moved/copied here"
That's all there is to it: Bear in mind the program will constantly access both arrays (which in reality are more complicated than described above), and thus will "know", e.g., that this specific "Block moved" line somewhere in pane 2 both is line 814 of buffer 2a, and in reality represents lines 2,812 to 2,933 of buffer 2 and actual file 2; and so, if the tool is coded in a smart way, it would even be possible to move that "Block moved" line, manually (= by mouse or by arrow key combination), on screen, to let's say line 745 of screen buffer 2a, and then, the program will properly move lines 2,812 to 2,933 of buffer 2, and within buffer 2, right after the line in buffer 2 which corresponds to line 745 in buffer 2a:
As you can see, there is no real problem whatsoever in implementing my way of "moved-blocks processing" into existing "differs"; it's just a little bit more info to be put into those arrays that are already there anyway, in order for all those "other differences" and their encodings to be stored after checking for them, the bread-and-butter tasks of any "differ" out there.
Hence, again, my question, where's the tremendous diffulty I don't see here and which hinders developers of tools like BC, etc., from implementing this feature we all crave for?