topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 6:57 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: SOLVED: Remove duplicate entries from within a single large (text) dictionary  (Read 15834 times)

bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
I need a small app that can remove duplicate entries from within a single large (anywhere from a couple hundred MB to several GB in size) text file. I have tried using several apps to do this, but they all crash, possibly because they run out of free RAM to use (loading an 800MB text file results in over 1.5GB of RAM being used by the apps). Text files are in the format of one word per line:

donation
coder
<3
cool
cat
Mike

If there is some workaround (including splitting files up into multiple parts and then comparing each one to the rest once to make sure there are no duplicates), it would be great.

Please see https://www.donation...ex.php?topic=26416.0 for sample dictionary files and let me know if you have any questions. Thank you!

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Assuming you are running Windows, I'd look for a port of Linux utilities.  Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti...ing-duplicate-lines/

« Last Edit: April 15, 2011, 05:40 PM by MilesAhead »

bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
Assuming you are running Windows, I'd look for a port of Linux utilities.  Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

http://www.cyberciti...ing-duplicate-lines/


I ended up installing Cygwin and I thought that "uniq -u input.txt output.txt" seemed to work, but since I can't open the large file and check myself, I decided to make a sample file and test it. Unfortunately, after running "uniq -u input.txt output.txt", it looks like it didn't remove unique entries and instead just copied input.txt to output.txt:

input.txt:
---------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

output.txt:
-----------
asdf
fdsa
asdf
fdsa
kjdshiufahysiufy

Edit:

"sort -u input.txt output.txt" seems to work, but it also organizes the lines in alphabetical order, which I don't need. Is there any way to have it only show unique lines, but not alphabetize everything?
« Last Edit: April 15, 2011, 06:42 PM by bhuiraj »

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
bhuiraj: uniq only works on sorted files, thus you need 'sort' first in the pipe.

Not sure it's the most efficient way to go about the issue, though - if you need to uniqueify huge amounts of data, it's probably a good idea to look for a specialized tool, since there's various memory/speed tradeoffs to be made. Don't know any such tool, though, as I haven't had the need myself :)
- carpe noctem

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Most of the solutions I found were memory based. For example, load into a dictionary(associative array, hash, map etc.) and write it back out.  Same problem.  I couldn't find a free Virtual Memory backed dictionary. Scripting.Dictionary would do it for small data, but I suspect if would crap out processing a GB file.

Others were use a db.  Linux solution is likely the simplest. Plus it's derived from a system that's done that type of stuff for a long time. Not Windows.

There should be some free Windows editors that handle files of arbitrary length.  Try Softpedia.  Just so you can look through the file without loading the entire file in ram. When you change file position you have to wait for some disk churning is all. :)

« Last Edit: April 15, 2011, 08:07 PM by MilesAhead »

bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
sort -u input.txt >output.txt (in Cygwin) is very fast and working great. Thank you for pointing me towards the ported nix apps.

@Moderator: Please close this thread.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
sort -u input.txt >output.txt (in Cygwin) is very fast and working great. Thank you for pointing me towards the ported nix apps.

@Moderator: Please close this thread.

Glad it worked for you. :)

bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
In case anyone was wondering, it would take well in excess of a week to sort a 33GB dictionary. I started sorting my 33GB file on April 15th and finally cancelled it today (after 9 days) not even half done.

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
In case anyone was wondering, it would take well in excess of a week to sort a 33GB dictionary. I started sorting my 33GB file on April 15th and finally cancelled it today (after 9 days) not even half done.
There must be some more efficient software out there.
- carpe noctem

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Out of curiosity, could you see the methodology?  I would think something that big would have to use some type of merge sort.  Esp. if you only have one disk that would be thrash city.

bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
Out of curiosity, could you see the methodology?  I would think something that big would have to use some type of merge sort.  Esp. if you only have one disk that would be thrash city.

In cygwin's tmp folder, I noticed that the sort process created several smaller temporary files that were merged together in stages. My thought was that it splits the large dictionary file up into many smaller files of approximately the same size (some dictionaries were split up into multiple 64MB files), compares them, merges some of them together, compares again, and so forth until it ends up with two files that are half the size of the complete unduped file. Then, those two files are merged together to create the final dictionary with the duplicates removed.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Thanks for the detailed reply.  :Thmbsup: