Author Topic: IDEA: Compare dictionary (text) files and remove duplicate entries (Read 33148 times)

bhuiraj · « **on:** April 14, 2011, 01:14 PM »

IT security is a hobby of mine and one aspect requires building large (from hundreds of megabytes to several gigabytes) dictionary (text) files in the format of one word on each line. For example:

donation
coder
<3
cat
dog
Mary

Currently, my collection includes several gigabytes of dictionaries spread out over dozens of text files (the largest being over 30GB, while most are between a few MB and 1.5GB). The problem is that I waste days and even weeks of processing time because of all the duplicate entries between files. So, I need a program that will let me select a master list and multiple additional lists and do any of the following (all functions would be ideal, but it's entirely up to whatever you are comfortable doing and consider a coding snack; if you can code an app to do all of these functions and they work with large text files, I can make a small donation to you as a thank you):
1) compare the master list to the additional lists and remove the duplicate entries from the additional lists
2) compare the additional lists to the master list and remove the duplicate entries from the master list (the opposite of #2)

I have tried a couple public programs, but they crash/error out when I try to use multiple files or files that are over a couple hundred megabytes. I came across posts suggesting that it might be because of RAM limitations, but I don't think that this is the case, since I have over 3GB of RAM.

I hope that someone will be able to do this and apologize if I have not given enough detail. Please let me know if you have any questions about this. Thank you!

skwire · « **Reply #1 on:** April 14, 2011, 01:46 PM »

Can you rar/7zip some sample files and make them available? If you don't have any webspace, I can PM you details to FTP them directly to my server.

bhuiraj · « **Reply #2 on:** April 14, 2011, 01:58 PM »

Sure

I keep 7zipped copies of most of them since they can be so large when uncompressed. I can upload a couple files, but wouldn't be able to upload the really big lists because of my connection speed. It would be faster for me to give you the links to some sample dictionaries:
http://java.sun.com/...mples/dictionary.txt
ftp://ftp.openwall.com/pub/wordlists/all.gz
http://dl.free.fr/hm...theargonlistver2.zip (click on the small text link "Télécharger ce fichier" to download)

But, let me know if you need me to upload some other lists I have or if you have any questions. Thanks

bhuiraj · « **Reply #3 on:** April 14, 2011, 02:01 PM »

There are also more dictionaries here:
http://wordlist.sourceforge.net/
ftp://nic.funet.fi/pub/unix/security/dictionaries/DEC-collection/
http://natura.di.umi...tionaries/wordlists/
http://www.wuala.com...d%20List.txt?lang=en
http://www.megaupload.com/?d=3YQT6SRB (a big, very compressible dictionary)
http://mirror.sweon....at/crypto/wordlists/
http://www.mirrorser...reference/wordlists/

rjbull · « **Reply #4 on:** April 14, 2011, 04:15 PM »

What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles. It's currently a 16-bit DOS program; details here. Even so, it might be worth trying, as it's very fast and powerful. You can combine files and remove duplicates in line, which would be ideal for you.

bhuiraj · « **Reply #5 on:** April 14, 2011, 06:42 PM »

What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles. It's currently a 16-bit DOS program; details here. Even so, it might be worth trying, as it's very fast and powerful. You can combine files and remove duplicates in line, which would be ideal for you.
-rjbull (April 14, 2011, 04:15 PM)

Looks like a great app, but I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."

rjbull · « **Reply #6 on:** April 17, 2011, 11:00 AM »

I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."
-bhuiraj (April 14, 2011, 06:42 PM)

Hmmm... it does have such a limit. I wasn't expecting that, as I'd assumed your "one word to a line" files would never have lines that long. 32750 bytes is a long word! Maybe RPSORT is throwing up a false message because that's the closest it can get to telling you about the real problem.

In that case, another possibility is Freeware Command Line Sort Utility CMSort by Christian Maas. It's a 32-bit command line sort utility for Windows 95/98/NT/2000/XP/Vista/7. I used older versions years ago, but settled on RPSORT because, where you could use either, RPSORT was very much faster, and much more feature-rich to boot. You'd probably have to make more than one pass with CMsort using different switches. However, it does work, and is the only 32-bit console mode sort tool I can think of at present. GNU would have sort and unique tools, but my experience of an (also old) Windows port of GNU sort wasn't too happy.

bhuiraj · « **Reply #7 on:** April 17, 2011, 11:19 AM »

It looks like CMSort is for removing duplicates from within a single file (something I am doing with cygwin and "sort -u") and can't compare two or more files and remove duplicates from one.

rjbull · « **Reply #8 on:** April 17, 2011, 04:12 PM »

I read the doc to mean CMsort can remove duplicates, but needs more work - quoting CMsort's readme:

III. Example 2
==============

This example shows how to ignore duplicate records. Duplicate records
are recognized by the defined key, not by the whole line. If you want
to exclude duplicate lines, you must perfom an additional sort
beforehead by using the whole line as key.

You could combine multiple files before sorting by concatenation, e.g.

copy file_1+file_2+file_3 bigfile

or use a Unix "cat" utility. But if you're already (erm) "sorted" with GNU sort, fine. I probably had a DOS port of it, rather than a Windows console port, and I'd imagine the latter should work better, especially on your large files.

MilesAhead · « **Reply #9 on:** April 17, 2011, 04:57 PM »

I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list? Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?

bhuiraj · « **Reply #10 on:** April 17, 2011, 06:37 PM »

I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list? Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?

-MilesAhead (April 17, 2011, 04:57 PM)

The problem is that ending up with a 40 or 50GB dictionary is unwieldy, to say the least. This is a problem because I am constantly updating the dictionary. An example of the purpose of being able to compare lists is if I have already tested with a 5GB dictionary and I find another dictionary that I also want to use to test the target, I can compare the new list to the original 5GB list and remove duplicate entries (dictionary entries that were already tested in the 5GB list) so that I am not retesting those same entries when I test with the new list. I have dozens of dictionaries and there are millions of duplicate entries between them, which is not at all efficient.

Right now, I am removing duplicates from within a 30+GB dictionary (using "sort -u") and it has been running for almost 36 hours already. So, it is not practical to merge every new dictionary with an existing large dictionary and then remove duplicate entries within that file.

Also, you can ignore function #1 (I started using "copy *.txt output.txt" a few days ago and will remove that feature from my post now).

MilesAhead · « **Reply #11 on:** April 17, 2011, 07:34 PM »

Sounds like you need to incorporate some type of database engine. The efficient access to large sets of keys is done by it. How to access from ad hoc outside apps is another matter. But giant flat text files seem to have reached the practical limit for your purposes.

I would search around for a forum where people deal exclusively or mainly with db application issues. Once you have an idea of workable db storage then you may be able to describe a small glue app that can bridge the gap between the db and the apps that want to use it.

Thing is with the way you describe it, it sounds like the key "indexing" or sorting would have to be done repeatedly. Don't know if even a free db would do that much good. Seems too scattered to lend itself to streamlining.

Renegade · « **Reply #12 on:** April 17, 2011, 07:41 PM »

I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon.

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

bhuiraj · « **Reply #13 on:** April 17, 2011, 07:55 PM »

I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon.

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

-Renegade (April 17, 2011, 07:41 PM)

I was always under the impression that dictionary = wordlist. Anyway yes, I use them for auditing passwords of friends and family. Unfortunately, a lot of people (including my brother and parents) don't understand the importance of using a *unique* and *secure* password until I demonstrate for them how quickly their static password of choice ("password", "john1970", and so forth) can be compromised (it took ~3 seconds for my mom's and 5 minutes seconds for my dad's using a 700mb dictionary).

Also, it doesn't look like the entire 30+GB files is duplicates (though, there are a lot of junk entries).

MilesAhead · « **Reply #14 on:** April 17, 2011, 08:05 PM »

It's an interesting problem because most db are made to have keys to look up some associated data. In this case the "keys" are the data.

Instead of sorting a giant file you might look for an approach to sort a smaller "seed" file, then find some routine that adds to sorted flat files via binary search. The routine would do a binary search, if the word is not found, insert it in sorted position etc..

Still, large files on slow disks are a problem. Growing a small file until it reaches some optimum size may be an approach. At some point as you say the text file becomes unwieldy.

If anyone has already done it though, it's probably some db guru.

cmpm · « **Reply #15 on:** April 17, 2011, 08:21 PM »

free
http://www.prestosof...com/edp_examdiff.asp

not free
http://www.ultraedit...ts/ultracompare.html

http://www.scooterso...mpare-text-files.php

the only programs in my bookmarks that might help

MilesAhead · « **Reply #16 on:** April 17, 2011, 09:17 PM »

Just for chuckles I started a thread on W7 forum. See if anyone adds a useful suggestion there:

http://www.sevenforu...lem.html#post1353124

bhuiraj · « **Reply #17 on:** April 17, 2011, 11:23 PM »

Thank you for all the help, guys

. I will continue to look for a solution and follow the w7 thread. Please continue posting if you have any thoughts or new ideas.

DK2IT · « **Reply #18 on:** April 30, 2011, 11:37 AM »

Well, there is no problem to have a DB with the keyword as index key, the same system is the base for a search engine.
So I think you can find some DB software (I can suggest SQLite based that is a DB very fast and small, like SQLite Database Browser) and start to insert the keywords.
And if you create the field for keyword as unique, you cannot insert duplicates.
Of course, if you create an index on the keyword to have more speed in the search, that index will take up additional disk space.
I think this is the most efficient system.
Otherwise, you can use your system, maybe splitting words in several files, named with the starting letter of the keyword (A.txt for all the words starting with A, B.txt for words starting with b, etc.).
And if you think can be useful, I already have some little tool to compare and merge two text files with no duplicate. Of course haven't tried on very BIG text file.
However, I strong suggest you the first solution, that's more pratical, I think.

MilesAhead · « **Reply #19 on:** April 30, 2011, 05:58 PM »

The trouble with using a relational database in this case is the key is the only data. You're not saving anything or creating any efficiency. If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data. With the dictionary, there is no other data. There's nothing to optimize.

You could split the files according to first character as you say. But the only good way would be to build as you go along. Having 33 GB of spaghetti data to unravel at the start makes it unwieldy. You need an engine that builds the files from scratch and feed it words. Even then when the letter files get too big to hold in ram it's going to be slow because you'll have to either create space for insertions according to binary search, or try to use some fixed size average word length which will waste space.

I think it's the case with no easy solution.

DK2IT · « **Reply #20 on:** May 01, 2011, 05:33 AM »

The trouble with using a relational database in this case is the key is the only data. You're not saving anything or creating any efficiency. If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data. With the dictionary, there is no other data. There's nothing to optimize.

What do you mean?
According to your example, you have this huge wordlist:
ab
aba
abaca
abacas
abaci
aback
abacus
abacuses
...
etc. etc.

that can be insert into a database creating a table with only one field, something like that:

TABLE wordlist (
word VARCHAR
)

That's all.
Or, I do not have well understood your problem

You could split the files according to first character as you say. But the only good way would be to build as you go along. Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.

Of course, once you have the file well organized, the next time you add new keyword you must insert in the correct way. Or, you can have some background process to sort the file for binary search. Or, you can create index file (to make a fast search) to unordered data.

MilesAhead · « **Reply #21 on:** May 01, 2011, 03:58 PM »

Ok since you have the solution please provide it to the OP.
I'd be interested in the performance of a DB that's both keys only and many times larger than physical ram.

edit: in any case it seems a chore to try to massage the data in one go. Probably a better approach is to grab a chunk and sort it, then add entries as a background task. How to go about it may depend on the tools the OP is currently using. Perhaps the big word list can be built while still being usable by the front end tools.

I haven't been able to get any hits searching on this type of problem. Maybe it's not one RDBMS companies like to mention as I imagine it's a worst case scenario. All keys, nothing else to look up.

DK2IT · « **Reply #22 on:** May 02, 2011, 04:34 PM »

Well, I've made a test.
Maybe SQLITE is not so optimized as filesize, but it's very fast to insert and find words.
The filesize is about two times the TXT. And using index to make faster search, the size is four times!
However, with big storage size and on a NTFS volume using file compression, the filesize should not be a problem.
Here's my test: a file (in sqlite) with about 5,6 millions of words - maybe there are duplicates, but I've made a very quick import.
The search is very quickly, also using the "slow" LIKE.

bhuiraj · « **Reply #23 on:** May 02, 2011, 05:33 PM »

How would I use/apply this?

MilesAhead · « **Reply #24 on:** May 02, 2011, 05:36 PM »

I don't see what you suggested that I didn't already in this post:

https://www.donation....msg245865#msg245865