IDEA: Compare dictionary (text) files and remove duplicate entries

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

<< < (2/7) > >>

bhuiraj:
What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles. It's currently a 16-bit DOS program; details here. Even so, it might be worth trying, as it's very fast and powerful. You can combine files and remove duplicates in line, which would be ideal for you.
-rjbull (April 14, 2011, 04:15 PM)
--- End quote ---
Looks like a great app, but I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."

rjbull:
I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."-bhuiraj (April 14, 2011, 06:42 PM)
--- End quote ---

Hmmm... it does have such a limit. I wasn't expecting that, as I'd assumed your "one word to a line" files would never have lines that long. 32750 bytes is a long word! Maybe RPSORT is throwing up a false message because that's the closest it can get to telling you about the real problem.

In that case, another possibility is Freeware Command Line Sort Utility CMSort by Christian Maas. It's a 32-bit command line sort utility for Windows 95/98/NT/2000/XP/Vista/7. I used older versions years ago, but settled on RPSORT because, where you could use either, RPSORT was very much faster, and much more feature-rich to boot. You'd probably have to make more than one pass with CMsort using different switches. However, it does work, and is the only 32-bit console mode sort tool I can think of at present. GNU would have sort and unique tools, but my experience of an (also old) Windows port of GNU sort wasn't too happy.

bhuiraj:
It looks like CMSort is for removing duplicates from within a single file (something I am doing with cygwin and "sort -u") and can't compare two or more files and remove duplicates from one.

rjbull:
I read the doc to mean CMsort can remove duplicates, but needs more work - quoting CMsort's readme:
III. Example 2
==============

This example shows how to ignore duplicate records. Duplicate records
are recognized by the defined key, not by the whole line. If you want
to exclude duplicate lines, you must perfom an additional sort
beforehead by using the whole line as key.
--- End quote ---

You could combine multiple files before sorting by concatenation, e.g.

copy file_1+file_2+file_3 bigfile

or use a Unix "cat" utility. But if you're already (erm) "sorted" with GNU sort, fine. I probably had a DOS port of it, rather than a Windows console port, and I'd imagine the latter should work better, especially on your large files.

MilesAhead:
I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list? Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version