SOLVED: Remove duplicate entries from within a single large (text) dictionary

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Finished Programs

(1/3) > >>

bhuiraj:
I need a small app that can remove duplicate entries from within a single large (anywhere from a couple hundred MB to several GB in size) text file. I have tried using several apps to do this, but they all crash, possibly because they run out of free RAM to use (loading an 800MB text file results in over 1.5GB of RAM being used by the apps). Text files are in the format of one word per line:

donation
coder
<3
cool
cat
Mike

If there is some workaround (including splitting files up into multiple parts and then comparing each one to the rest once to make sure there are no duplicates), it would be great.

Please see https://www.donationcoder.com/forum/index.php?topic=26416.0 for sample dictionary files and let me know if you have any questions. Thank you!

MilesAhead:
Assuming you are running Windows, I'd look for a port of Linux utilities. Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti.biz/faq/unix-linux-shell-removing-duplicate-lines/

bhuiraj:
Assuming you are running Windows, I'd look for a port of Linux utilities. Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti.biz/faq/unix-linux-shell-removing-duplicate-lines/

-MilesAhead (April 15, 2011, 05:36 PM)
--- End quote ---
I ended up installing Cygwin and I thought that "uniq -u input.txt output.txt" seemed to work, but since I can't open the large file and check myself, I decided to make a sample file and test it. Unfortunately, after running "uniq -u input.txt output.txt", it looks like it didn't remove unique entries and instead just copied input.txt to output.txt:

input.txt:
---------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

output.txt:
-----------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

Edit:

"sort -u input.txt output.txt" seems to work, but it also organizes the lines in alphabetical order, which I don't need. Is there any way to have it only show unique lines, but not alphabetize everything?

f0dder:
bhuiraj: uniq only works on sorted files, thus you need 'sort' first in the pipe.

Not sure it's the most efficient way to go about the issue, though - if you need to uniqueify huge amounts of data, it's probably a good idea to look for a specialized tool, since there's various memory/speed tradeoffs to be made. Don't know any such tool, though, as I haven't had the need myself :)

MilesAhead:
Most of the solutions I found were memory based. For example, load into a dictionary(associative array, hash, map etc.) and write it back out. Same problem. I couldn't find a free Virtual Memory backed dictionary. Scripting.Dictionary would do it for small data, but I suspect if would crap out processing a GB file.

Others were use a db. Linux solution is likely the simplest. Plus it's derived from a system that's done that type of stuff for a long time. Not Windows.

There should be some free Windows editors that handle files of arbitrary length. Try Softpedia. Just so you can look through the file without loading the entire file in ram. When you change file position you have to wait for some disk churning is all. :)

Navigation

[0] Message Index

[#] Next page

Go to full version