The problem with text compare tools - similar, in database compare tools

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

<< < (3/5) > >>

evamaria:
(Here, originally rant about the Snowden affair.)

wraith808:
trap? accounts closing down?

...O...k...

:o :huh:

I was just speaking against calling someone a fascist for modding a board. Sort of lessens the use of the word and is a pejorative that's really not needed.
* wraith808 backs away slowly...

Tinman57:

I always use CsDiff and VData, which are both free. I use them to compare files between hard drives and CD/DVD's to make sure the files are the same.....

evamaria:
wraith808, you are perfectly right, and of course, there are BIG differences, and it's not true that "with such character traits, you could run a concentration camp, too", as some German politician once said about some adversary. It's just that I have a problem with closing down threads (over there) that undeniably contain (sometimes highly) constructive content, for being "not constructive", AND I have a VERY BIG problem with what we've seen from ALL our European governments, these last weeks, when at least two thirds of our population(s? can't speak so much for other countries than Germany) would have much liked to grant asylum to Edward Snowden, and when all we see is EVERY government out there would like to "get him", in order to make a package out of him straight to Obama, the Nobel prize. So I perfectly all mixed it up here, will try to export my political views to a second thread, in order to clean this up as far as possible. To do something constructive:

(Tinman57, you're not speaking of proper moved-blocks processing, which is the subject here?) ;-)

So, problem and askSam solution

As said, I hadn't got my newest outliner files at hand; I only had backups, and not the newest ones.

I assumed they were sufficiently recent - I was wrong for two big files; but I was smart enough, at least, to work on copies, not on the old originals, and within a specific, new folder to which I copied the files I needed; it would have been much smarter, though (see special prob below), to classifiy these copies as "read-only", since for any new stuff, I had created a new file "NEW" anyway.

Now, my file "c" (= in fact, my "Computer-related things" "Inbox" (!), always much too big) is in 3 versions:
- May (964 items, the old backup)
- MayJuly (964 items on start, a copy of "May" I worked on in July, and 964 items in the end = pure coincidence)
- June (497 items, the current file before doing that mess in MayJuly (and after having moved out, in June, many items to other files)

Special problem: I had mistakenly assumed the old May backup was rather recent (or rather, after some days, I simply forgot it was as old as it was...), so I moved lots of files within the MayJuly file, which would of course have been to be avoided, so I worsened my original problem considerably.

Task: Identify any item in the MayJuly file that had either been altered or added, and deleted files here, compared with the May version, and replicate the changes / additions within the June file, and taking care the aforementioned special problem of the order / position of many items having been mixed up between May and MayJuly.

My outliner allows for exporting items (title and content) as plain text, and for separating items by a code sign before each title; it does NOT allow for exporting items in a flattened-out, then sorted "list" form. (Remark: Some (especially db-based) outliners will allow for identifying changed/added items within the application itself, so you will not need external tools in such cases, but we try to find a solution for any case where this easy way out is not possible, i.e. in "lesser" outliners, in editors, or in any text processing application.)

Solution:

In your outliner or text file:

- Do the plain text export, for May and for MayJuly file, with a Yen sign (Alt 0165) before each item.

In askSam:

- Open a "blank" file in AS (if you don't have a license, the trial version will do)

- "File-Import", then "after the last item" (there is an AS header item there, so after import, there should be 1 plus 964 = 965 items); other setting there: "Document delimiter: STRING", then Alt 0165 or whatever, and you also can check the box "remove string"); file origin is Win Ansi (default), but the important thing here is, for every new import, you have to enter the "string: alt..." anew: it's AS, there's nothing intuitive here... - after import, you'll get a message: "Imported: 964 items" (and so, in this case, the AS file comprises 965 records)

- Do the same for the second file, i.e. open a second blank file in AS, and import your second .txt file here (as said, pay attention to enter the string delimiter again)

Then in AS, for the first file, then for the second one:

- If my description does not suffice, there is a subject "Saving a file in sorted order" in the AS help. So:

- Do not anything in the line of "Actions-Sort", but:

- Do the menu "File"-"Export"-"Select Documents" (= NOT "entire file")
- Click button "Clear All" if not greyed out
- Click button "Sort : None" (sic - we're in AS country here..., and don't click "Help": it's NOT context-sensitive...)

- Now, in the "Sort" dialog, click in the grey field beneath "Sort on", then select "first line in document" (this will also show "Type:text" and "Order:ascending"
- Click OK

- This will bring you back to the "Search" dialog, where it always shows "There are no items to show" (= not intuitive, rather deterring, but it's simply the info that your previous "ok" just made settings, but didn't trigger the actual sort yet), but where the button changed to "Sort Defined"
- Click OK here, again (Don't click on "Clear All" now, since the Sort button would revert to "Sort: None", of course)
- This brings the "Export" dialog, finally, into which you'll enter the target file name, and again "Ok"

- You'll get the message "Exported: 965" (or whatever: number of your items plus the one "header" item exported, too)

You'll do all this for both files, then import them into the "differ" of your choice.

Now, perhaps a new problem arises: AS will have sorted your items just by the very first line of them, which is, by the outliner export above, the title you gave to your items within the outliner; similar if you exported from within an editor: first line.

Problem in my case, I often title items "again", or "ditto" or such, for indented (sub-) items, and here, a "sorting tool" like AS creates chaos, since there is no further, sub-sorting, by line 2, 3. In fact, I've got dozens of such "ditto" 's among these, so my "sorted" .txt files didn't work any better in my "differ" than the original ones, and I needed several hours to work it out, "manually" - it would have been smarter to number all my ditto's first, then sort, then take those unwanted numbers there away again... - This problem also shows the importance of sorting QUALITY, i.e. of the need, for good sort, to not do it just by one criterion but by several. (AS is not really to blame here, since if you sort by field content there, it lets you combine several fields.)

But then, in a programming environment, you will probably give "expressive" names to your routines, etc., so as to not have identically-named "items" to be sorted then. As said, in case of necessity, do it with out-commented lines before the beginning of the routine, etc., then using the special comment character you use here, as the record divider character; with "special comment character", I mean the ordinary comment character, plus a second character, e.g. a ";", plus a second ";" here (whilst for ordinary comments, you'll just use the single ";" character), since AS accepts strings as "record code" - if you do something similar with a tool that only accepts ONE such "record code" character, for programming, you will have to do the same, within your code, i.e. do the regular comment character, plus a second special character, i.e. double the comment character or anything else... but before export from your editor or such or before import into the sorting tool, you'll have to replace the "double comment char" with something special again, since you would certainly not want the sorting tool to separate ordinary comments from code because it "thinks" there is a "new record code".

So this is a viable solution in case we will not discover a "differ" doing "block identification" in a really reliable way; in my case, the only problems I encountered, arose from my identical naming of different items, hence the need to pay attention to make these very first lines of your code entities distinct: If they are code, they will be automatically, but then, the "new record code" will be difficult, so it will be a comment line, and so some numbering will perhaps do the necessary distinction. In most outliners, you will be able to put the special character before your items, afterwards (as seen above), but even in an editor or such, in order to properly "fold" your code by various criteria, properly "encoded" comment lines starting each code block will certainly be a good idea: first, the comment char, then some special characters by which you will fold your code in different ways, then any "real comment" here.

This remains me: Some writer (I believe it was a novelist) mentioned somewhere a programmer friend of his tweaked KEdit for him to write all his stuff within just this editor (which I explained elsewhere in this forum just some months ago since it's one of the best (and weirdest, yes) you can get), and I'm sure that writer does use such "code chars", too, e.g. for chapters, subchapters, etc., and even paragraphs to be refined, etc., in order to "fold" by them (and no, he didn't explain his system any further, so I don't see the need for searching the relevant links).

evamaria:
See simultaneous, previous post!

Programming solution of the original problem

We have to distinguish between "Moved Lines"

from what "they" say, these programs can process them "properly" (this notion will have to be discussed later):

ExamDiff Pro, Code Compare, Meld, WinDiff, WinMerge, UCC, Compare Suite (?), XDiff/XMerge

As said, I tried WinMerge to no avail, but with blocks, so perhaps with just single lines (and see below)...

and "Moved Blocks"

here, we can assume that no tool will be able to process these properly if it isn't able to process moved lines, to begin with, so this group should be a sub-group of the above.

Also, there might be "Special, recognizable blocks"

Which means, some tools try to recognize the used programming language in your text, and then, they could try to "get" such blocks or whatever such tools then understand by this term, and recognize them when moved... whilst the same tools could completely fail whenever normal "body text" is concerned.

In the case of WinMerge, perhaps this tool is an example of this distinction-to-be-made, but then, it would be helpful if the developers told us something about it in the help file; as it is, WinMerge does NOT recognize moved blocks in my trial.

The Copy vs. Original problem

In my post above, I mused if I had overlooked something important, since the solution I presented, is easy to code, so why the absence of proper moved-block processing in almost all (or perhaps all) relevant tools? I know now:

In fact, my intermediate solution resides on working on at least one copy of the original file, on the side where moved blocks are detected, and then deleted in order to "straighten out" the rest of the, for any "non-moved-blocks" comparison.

On the other hand, all (?) current "differs" use the original files, and (hopefully) let you edit them (both).

The file copying could be automated, i.e. the user selects the original, then the tool creates a copy on which it will then work, so this is not the problem; also, editing on a copy could then be replicated on the original file, but that's not so easy to code anymore.

Now, we have to differenciate between what you see within the panes, and what the tool is "working" on, or rather, let's assume the tool, for both files each, will process, in memory, two files, one with the actual content, the other for what you see on the screen.

Now, I said, the tool should delete all moved blocks (in one of the two files, not in both): Yes, it could do this within a additional, intermediate copy, just for speeding up any block compare later on, OR it could do these compares on parts of the actual file, reading, from a table/array/list, the respective lines to compare the actual block to:

- first, write the block table (array a) for file 1 (by the BlockBeginCode, e.g. Alt 0165): block 1 = lines 1 to 17, block 2 = lines 18 to 48, and so end up to eof (endoffile); also, write the very first chars, perhaps from char 1 to char 10, of the block (our block 2 example: chars 1 to 10 of line 18 of file 1) into the array
- then, write the block table for file 2 (array b), in the same way

- then, FOR EVERY block in file 1 (= for block 1 to block xxx, detailed in the corresponding line x in array a):
-- run array b (= the one for file 2), from line 1 to line yyy, meaning: check, for any line in array b, if the first characters of block y (= the characters from the line "startnumber of block y" in array b, put into the array earlier) are identical with those read from line x in array a
-- and only IF they are identical:
--- on first hit for this x here in 2, read the text of the lines of the block x of file 1 into buffer "a" (for non-programmers, that's a file just existing within working memory), and in any case:
--- both put all lines of this block y of file 2 into a buffer "b",
--- and run a subroutine checking if the respective contents of buffer "a" and of buffer "b" are identical (in practice, you would cut up this subroutine into several ones, first checking the whole first line, or better, the whole first 50 chars, etc., with breaking the whole subroutine on first difference)
--- and IF there is identity between two buffers, the routine would not delete the corresponding block y, but put a "marker" into both line x of array a and line y of array b
-- and just to make this clear, even if there is identity, for this x block, all remaining y blocks will be checked notwithstanding, since it's possible that the user has copied a block, perhaps several times, instead of just moving it
END FOR

- then, the routine would have unchanged files 1 and 2, in its buffers 1 and 2, respectively, but would have all the necessary processing information within its arrays a and b
- it would then create, in buffers 1a and 2a, files 1a and 2a, the respective "display" files onto which then to process

- then, it would not only process both display panes, according to the array info, but also the display of the "changes ribbon" (which most "differs" also display):
- the prog then takes checks for all those "markers" in array a, and will display just a line, instead of the block, saying something like "Block of n lines; has been moved in file 2", or "Block of n lines; was moved/copied 3 times in file 2"
- similar then for all "block markers" in array b: the display just shows a line, saying "Block moved/copied here"

That's all there is to it: Bear in mind the program will constantly access both arrays (which in reality are more complicated than described above), and thus will "know", e.g., that this specific "Block moved" line somewhere in pane 2 both is line 814 of buffer 2a, and in reality represents lines 2,812 to 2,933 of buffer 2 and actual file 2; and so, if the tool is coded in a smart way, it would even be possible to move that "Block moved" line, manually (= by mouse or by arrow key combination), on screen, to let's say line 745 of screen buffer 2a, and then, the program will properly move lines 2,812 to 2,933 of buffer 2, and within buffer 2, right after the line in buffer 2 which corresponds to line 745 in buffer 2a:

As you can see, there is no real problem whatsoever in implementing my way of "moved-blocks processing" into existing "differs"; it's just a little bit more info to be put into those arrays that are already there anyway, in order for all those "other differences" and their encodings to be stored after checking for them, the bread-and-butter tasks of any "differ" out there.

Hence, again, my question, where's the tremendous diffulty I don't see here and which hinders developers of tools like BC, etc., from implementing this feature we all crave for?

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version