Author Topic: SOLVED: Remove duplicate entries from within a single large (text) dictionary (Read 19213 times)

bhuiraj · « **on:** April 15, 2011, 11:03 AM »

I need a small app that can remove duplicate entries from within a single large (anywhere from a couple hundred MB to several GB in size) text file. I have tried using several apps to do this, but they all crash, possibly because they run out of free RAM to use (loading an 800MB text file results in over 1.5GB of RAM being used by the apps). Text files are in the format of one word per line:

donation
coder
<3
cool
cat
Mike

If there is some workaround (including splitting files up into multiple parts and then comparing each one to the rest once to make sure there are no duplicates), it would be great.

Please see https://www.donation...ex.php?topic=26416.0 for sample dictionary files and let me know if you have any questions. Thank you!

MilesAhead · « **Reply #1 on:** April 15, 2011, 05:36 PM »

Assuming you are running Windows, I'd look for a port of Linux utilities. Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti...ing-duplicate-lines/

bhuiraj · « **Reply #2 on:** April 15, 2011, 06:37 PM »

Assuming you are running Windows, I'd look for a port of Linux utilities. Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti...ing-duplicate-lines/

-MilesAhead (April 15, 2011, 05:36 PM)

I ended up installing Cygwin and I thought that "uniq -u input.txt output.txt" seemed to work, but since I can't open the large file and check myself, I decided to make a sample file and test it. Unfortunately, after running "uniq -u input.txt output.txt", it looks like it didn't remove unique entries and instead just copied input.txt to output.txt:

input.txt:
---------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

output.txt:
-----------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

Edit:

"sort -u input.txt output.txt" seems to work, but it also organizes the lines in alphabetical order, which I don't need. Is there any way to have it only show unique lines, but not alphabetize everything?

f0dder · « **Reply #3 on:** April 15, 2011, 06:45 PM »

bhuiraj: uniq only works on sorted files, thus you need 'sort' first in the pipe.

Not sure it's the most efficient way to go about the issue, though - if you need to uniqueify huge amounts of data, it's probably a good idea to look for a specialized tool, since there's various memory/speed tradeoffs to be made. Don't know any such tool, though, as I haven't had the need myself

MilesAhead · « **Reply #4 on:** April 15, 2011, 08:00 PM »

Most of the solutions I found were memory based. For example, load into a dictionary(associative array, hash, map etc.) and write it back out. Same problem. I couldn't find a free Virtual Memory backed dictionary. Scripting.Dictionary would do it for small data, but I suspect if would crap out processing a GB file.

Others were use a db. Linux solution is likely the simplest. Plus it's derived from a system that's done that type of stuff for a long time. Not Windows.

There should be some free Windows editors that handle files of arbitrary length. Try Softpedia. Just so you can look through the file without loading the entire file in ram. When you change file position you have to wait for some disk churning is all.

bhuiraj · « **Reply #5 on:** April 16, 2011, 01:34 AM »

sort -u input.txt >output.txt (in Cygwin) is very fast and working great. Thank you for pointing me towards the ported nix apps.

@Moderator: Please close this thread.

MilesAhead · « **Reply #6 on:** April 16, 2011, 01:36 AM »

sort -u input.txt >output.txt (in Cygwin) is very fast and working great. Thank you for pointing me towards the ported nix apps.

@Moderator: Please close this thread.
-bhuiraj (April 16, 2011, 01:34 AM)

Glad it worked for you.

bhuiraj · « **Reply #7 on:** April 24, 2011, 11:53 AM »

In case anyone was wondering, it would take well in excess of a week to sort a 33GB dictionary. I started sorting my 33GB file on April 15th and finally cancelled it today (after 9 days) not even half done.

f0dder · « **Reply #8 on:** April 24, 2011, 01:47 PM »

In case anyone was wondering, it would take well in excess of a week to sort a 33GB dictionary. I started sorting my 33GB file on April 15th and finally cancelled it today (after 9 days) not even half done.
-bhuiraj (April 24, 2011, 11:53 AM)

There must be some more efficient software out there.

MilesAhead · « **Reply #9 on:** April 24, 2011, 04:13 PM »

Out of curiosity, could you see the methodology? I would think something that big would have to use some type of merge sort. Esp. if you only have one disk that would be thrash city.

bhuiraj · « **Reply #10 on:** April 24, 2011, 05:09 PM »

Out of curiosity, could you see the methodology? I would think something that big would have to use some type of merge sort. Esp. if you only have one disk that would be thrash city.

-MilesAhead (April 24, 2011, 04:13 PM)

In cygwin's tmp folder, I noticed that the sort process created several smaller temporary files that were merged together in stages. My thought was that it splits the large dictionary file up into many smaller files of approximately the same size (some dictionaries were split up into multiple 64MB files), compares them, merges some of them together, compares again, and so forth until it ends up with two files that are half the size of the complete unduped file. Then, those two files are merged together to create the final dictionary with the duplicates removed.

MilesAhead · « **Reply #11 on:** April 24, 2011, 08:23 PM »

Thanks for the detailed reply.

Author Topic: SOLVED: Remove duplicate entries from within a single large (text) dictionary (Read 19213 times)

bhuiraj

SOLVED: Remove duplicate entries from within a single large (text) dictionary

MilesAhead

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

bhuiraj

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

f0dder

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

MilesAhead

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

bhuiraj

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

MilesAhead

Re: IDEA: Remove duplicate entries from within a single large (text) dictionary

bhuiraj

Re: SOLVED: Remove duplicate entries from within a single large (text) dictionary

f0dder

Re: SOLVED: Remove duplicate entries from within a single large (text) dictionary

MilesAhead

Re: SOLVED: Remove duplicate entries from within a single large (text) dictionary

bhuiraj

Re: SOLVED: Remove duplicate entries from within a single large (text) dictionary

MilesAhead

Re: SOLVED: Remove duplicate entries from within a single large (text) dictionary