DonationCoder.com Forum

DonationCoder.com Software => Coding Snacks => Post New Requests Here => Topic started by: bhuiraj on April 14, 2011, 01:14 PM

Title: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 14, 2011, 01:14 PM
IT security is a hobby of mine and one aspect requires building large (from hundreds of megabytes to several gigabytes) dictionary (text) files in the format of one word on each line. For example:

donation
coder
<3
cat
dog
Mary

Currently, my collection includes several gigabytes of dictionaries spread out over dozens of text files (the largest being over 30GB, while most are between a few MB and 1.5GB). The problem is that I waste days and even weeks of processing time because of all the duplicate entries between files. So, I need a program that will let me select a master list and multiple additional lists and do any of the following (all functions would be ideal, but it's entirely up to whatever you are comfortable doing and consider a coding snack; if you can code an app to do all of these functions and they work with large text files, I can make a small donation to you as a thank you):
1) compare the master list to the additional lists and remove the duplicate entries from the additional lists
2) compare the additional lists to the master list and remove the duplicate entries from the master list (the opposite of #2)

I have tried a couple public programs, but they crash/error out when I try to use multiple files or files that are over a couple hundred megabytes. I came across posts suggesting that it might be because of RAM limitations, but I don't think that this is the case, since I have over 3GB of RAM.

I hope that someone will be able to do this and apologize if I have not given enough detail. Please let me know if you have any questions about this. Thank you!
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: skwire on April 14, 2011, 01:46 PM
Can you rar/7zip some sample files and make them available?  If you don't have any webspace, I can PM you details to FTP them directly to my server.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 14, 2011, 01:58 PM
Sure :) I keep 7zipped copies of most of them since they can be so large when uncompressed. I can upload a couple files, but wouldn't be able to upload the really big lists because of my connection speed. It would be faster for me to give you the links to some sample dictionaries:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
ftp://ftp.openwall.com/pub/wordlists/all.gz
http://dl.free.fr/hmUZYo0GE/theargonlistver2.zip (click on the small text link "Télécharger ce fichier" to download)

But, let me know if you need me to upload some other lists I have or if you have any questions. Thanks :)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 14, 2011, 02:01 PM
There are also more dictionaries here:
http://wordlist.sourceforge.net/
ftp://nic.funet.fi/pub/unix/security/dictionaries/DEC-collection/
http://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/
http://www.wuala.com/Lifehacker%20Fun%20File%20Swap/Documents/Huge%20Word%20List.txt?lang=en
http://www.megaupload.com/?d=3YQT6SRB (a big, very compressible dictionary)
http://mirror.sweon.net/madchat/crypto/wordlists/
http://www.mirrorservice.org/sites/ftp.wiretapped.net/pub/security/info/reference/wordlists/
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: rjbull on April 14, 2011, 04:15 PM
What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles.  It's currently a 16-bit DOS program; details here (http://reimagery.com/fsfd/txtutil2.htm).  Even so, it might be worth trying, as it's very fast and powerful.  You can combine files and remove duplicates in line, which would be ideal for you.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 14, 2011, 06:42 PM
What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles.  It's currently a 16-bit DOS program; details here (http://reimagery.com/fsfd/txtutil2.htm).  Even so, it might be worth trying, as it's very fast and powerful.  You can combine files and remove duplicates in line, which would be ideal for you.
Looks like a great app, but I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: rjbull on April 17, 2011, 11:00 AM
I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."

Hmmm...  it does have such a limit.  I wasn't expecting that, as I'd assumed your "one word to a line" files would never have lines that long.  32750 bytes is a long word!  Maybe RPSORT is throwing up a false message because that's the closest it can get to telling you about the real problem.

In that case, another possibility is Freeware Command Line Sort Utility CMSort (http://www.chmaas.handshake.de/delphi/freeware/cmsort/cmsort.htm) by Christian Maas.  It's a 32-bit command line sort utility for Windows 95/98/NT/2000/XP/Vista/7.  I used older versions years ago, but settled on RPSORT because, where you could use either, RPSORT was very much faster, and much more feature-rich to boot.  You'd probably have to make more than one pass with CMsort using different switches.  However, it does work, and is the only 32-bit console mode sort tool I can think of at present.  GNU would have sort and unique tools, but my experience of an (also old) Windows port of GNU sort wasn't too happy.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 17, 2011, 11:19 AM
It looks like CMSort is for removing duplicates from within a single file (something I am doing with cygwin and "sort -u") and can't compare two or more files and remove duplicates from one.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: rjbull on April 17, 2011, 04:12 PM
I read the doc to mean CMsort can remove duplicates, but needs more work - quoting CMsort's readme:
III. Example 2
==============

This example shows how to ignore duplicate records. Duplicate records
are recognized by the defined key, not by the whole line. If you want
to exclude duplicate lines, you must perfom an additional sort
beforehead by using the whole line as key.

You could combine multiple files before sorting by concatenation, e.g.

copy file_1+file_2+file_3 bigfile

or use a Unix "cat" utility.  But if you're already (erm) "sorted" with GNU sort, fine.  I probably had a DOS port of it, rather than a Windows console port, and I'd imagine the latter should work better, especially on your large files.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on April 17, 2011, 04:57 PM
I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list?  Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 17, 2011, 06:37 PM
I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list?  Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?


The problem is that ending up with a 40 or 50GB dictionary is unwieldy, to say the least. This is a problem because I am constantly updating the dictionary. An example of the purpose of being able to compare lists is if I have already tested with a 5GB dictionary and I find another  dictionary that I also want to use to test the target, I can compare the new list to the original 5GB list and remove duplicate entries (dictionary entries that were already tested in the 5GB list) so that I am not retesting those same entries when I test with the new list. I have dozens of dictionaries and there are millions of duplicate entries between them, which is not at all efficient.

Right now, I am removing duplicates from within a 30+GB dictionary (using "sort -u") and it has been running for almost 36 hours already. So, it is not practical to merge every new dictionary with an existing large dictionary and then remove duplicate entries within that file.

Also, you can ignore function #1 (I started using "copy *.txt output.txt" a few days ago and will remove that feature from my post now).
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on April 17, 2011, 07:34 PM
Sounds like you need to incorporate some type of database engine.  The efficient access to large sets of keys is done by it.  How to access from ad hoc outside apps is another matter.  But giant flat text files seem to have reached the practical limit for your purposes.

I would search around for a forum where people deal exclusively or mainly with db application issues.  Once you have an idea of workable db storage then you may be able to describe a small glue app that can bridge the gap between the db and the apps that want to use it.

Thing is with the way you describe it, it sounds like the key "indexing" or sorting would have to be done repeatedly.  Don't know if even a free db would do that much good.  Seems too scattered to lend itself to streamlining.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: Renegade on April 17, 2011, 07:41 PM
I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon. :)

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 17, 2011, 07:55 PM
I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon. :)

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.


I was always under the impression that dictionary = wordlist. Anyway yes, I use them for auditing passwords of friends and family. Unfortunately, a lot of people (including my brother and parents) don't understand the importance of using a *unique* and *secure* password until I demonstrate for them how quickly their static password of choice ("password", "john1970", and so forth) can be compromised (it took ~3 seconds for my mom's and 5 minutes seconds for my dad's using a 700mb dictionary).

Also, it doesn't look like the entire 30+GB files is duplicates (though, there are a lot of junk entries).
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on April 17, 2011, 08:05 PM
It's an interesting problem because most db are made to have keys to look up some associated data. In this case the "keys" are the data.

Instead of sorting a giant file you might look for an approach to sort a smaller "seed" file, then find some routine that adds to sorted flat files via binary search.  The routine would do a binary search, if the word is not found, insert it in sorted position etc..

Still, large files on slow disks are a problem.  Growing a small file until it reaches some optimum size may be an approach.  At some point as you say the text file becomes unwieldy.

If anyone has already done it though, it's probably some db guru.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: cmpm on April 17, 2011, 08:21 PM
free
http://www.prestosoft.com/edp_examdiff.asp

not free
http://www.ultraedit.com/products/ultracompare.html

http://www.scootersoftware.com/compare-text-files.php

the only programs in my bookmarks that might help

Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on April 17, 2011, 09:17 PM
Just for chuckles I started a thread on W7 forum.  See if anyone adds a useful suggestion there:

http://www.sevenforums.com/software/157723-interesting-db-dictionary-problem.html#post1353124
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on April 17, 2011, 11:23 PM
Thank you for all the help, guys :). I will continue to look for a solution and follow the w7 thread. Please continue posting if you have any thoughts or new ideas.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on April 30, 2011, 11:37 AM
Well, there is no problem to have a DB with the keyword as index key, the same system is the base for a search engine.
So I think you can find some DB software (I can suggest SQLite based that is a DB very fast and small, like SQLite Database Browser (http://sqlitebrowser.sourceforge.net/index.html)) and start to insert the keywords.
And if you create the field for keyword as unique, you cannot insert duplicates.
Of course, if you create an index on the keyword to have more speed in the search, that index will take up additional disk space.
I think this is the most efficient system.
Otherwise, you can use your system, maybe splitting words in several files, named with the starting letter of the keyword (A.txt for all the words starting with A, B.txt for words starting with b, etc.).
And if you think can be useful, I already have some little tool to compare and merge two text files with no duplicate. Of course haven't tried on very BIG text file.
However, I strong suggest you the first solution, that's more pratical, I think.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on April 30, 2011, 05:58 PM
The trouble with using a relational database in this case is the key is the only data.  You're not saving anything or creating any efficiency.  If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data.  With the dictionary, there is no other data.  There's nothing to optimize.

You could split the files according to first character as you say.  But the only good way would be to build as you go along.  Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.  You need an engine that builds the files from scratch and feed it words.  Even then when the letter files get too big to hold in ram it's going to be slow because you'll have to either create space for insertions according to binary search, or try to use some fixed size average word length which will waste space.

I think it's the case with no easy solution.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on May 01, 2011, 05:33 AM
The trouble with using a relational database in this case is the key is the only data.  You're not saving anything or creating any efficiency.  If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data.  With the dictionary, there is no other data.  There's nothing to optimize.
What do you mean?
According to your example, you have this huge wordlist:
ab
aba
abaca
abacas
abaci
aback
abacus
abacuses
...
etc. etc.

that can be insert into a database creating a table with only one field, something like that:

TABLE wordlist (
word VARCHAR
)

That's all.
Or, I do not have well understood your problem  :(

You could split the files according to first character as you say.  But the only good way would be to build as you go along.  Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.
Of course, once you have the file well organized, the next time you add new keyword you must insert in the correct way. Or, you can have some background process to sort the file for binary search. Or, you can create index file (to make a fast search) to unordered data.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 01, 2011, 03:58 PM
Ok since you have the solution please provide it to the OP.
I'd be interested in the performance of a DB that's both keys only and many times larger than physical ram.

edit: in any case it seems a chore to try to massage the data in one go.  Probably a better approach is to grab a chunk and sort it, then add entries as a background task. How to go about it may depend on the tools the OP is currently using.  Perhaps the big word list can be built while still being usable by the front end tools.

I haven't been able to get any hits searching on this type of problem. Maybe it's not one RDBMS companies like to mention as I imagine it's a worst case scenario.  All keys, nothing else to look up.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on May 02, 2011, 04:34 PM
Well, I've made a test.
Maybe SQLITE is not so optimized as filesize, but it's very fast to insert and find words.
The filesize is about two times the TXT. And using index to make faster search, the size is four times!
However, with big storage size and on a NTFS volume using file compression, the filesize should not be a problem.
Here's (http://www.megaupload.com/?d=B755O5K8) my test: a file (in sqlite) with about 5,6 millions of words - maybe there are duplicates, but I've made a very quick import.
The search is very quickly, also using the "slow" LIKE.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on May 02, 2011, 05:33 PM
How would I use/apply this? :)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 02, 2011, 05:36 PM
I don't see what you suggested that I didn't already in this post:

https://www.donationcoder.com/forum/index.php?topic=26416.msg245865#msg245865
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 02, 2011, 05:41 PM
How would I use/apply this? :)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker.  For example AutoIt3 has the user defined function _FileReadToArray().  If the file fits in ram it's trivial. Each array element is a line of the file.  Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.

Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 02, 2011, 06:13 PM
Just for grins I started a thread on a db forum:

http://forums.databasejournal.com/showthread.php?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!

(http://smileys.smilchat.net/emoticon/example/4.gif)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: bhuiraj on May 02, 2011, 06:51 PM
How would I use/apply this? :)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker.  For example AutoIt3 has the user defined function _FileReadToArray().  If the file fits in ram it's trivial. Each array element is a line of the file.  Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.


I use different pieces of software, so there isn't one specific one that I only use. All of them support this plain dictionary/wordlist/text file format.

Just for grins I started a thread on a db forum:

http://forums.databasejournal.com/showthread.php?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!

(http://smileys.smilchat.net/emoticon/example/4.gif)

lol thanks :)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on May 03, 2011, 03:03 AM
How would I use/apply this? :)
Just use some DB manager for SQLite, like this SQLite Database Browser (http://sqlitebrowser.sourceforge.net/index.html), or the command line version (http://www.sqlite.org/download.html) or there are many other programs (http://www.sqlite.org/cvstrac/wiki?p=ManagementTools).

I don't see what you suggested that I didn't already in this post:
https://www.donationcoder.com/forum/index.php?topic=26416.msg245865#msg245865
Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 03, 2011, 07:24 PM
Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

I added a reply to the "ask the expert" thread I started saying there has to be a "worst case scenario" with keys-only db likely to be it.  That was yesterday. I see it hasn't cleared the moderator. I think they don't really want to bring up the Achilles Heel.  I doubt I'll see my reply.

To really test this out you should have some method that directly accesses the flat file.  Compare it for speed vs. overhead.  A dummy run of a few MB doesn't mean anything.  Just about any manipulation all in ram is going to be fast.  We need a comparison of db and non-db access say for an 8 GB flat file of words.  Then see what happens.

I would tend to guess the db overhead would not be worth the effort compared to direct flat file access and manipulation for simple search. Also I suspect if you made a 34 GB table of keys, the db would crash on the OP's machine. :)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on May 04, 2011, 12:48 PM
Of course this is a quick and fast solution for bhuiraj, it's not optimal, but do not require special software. I've tested 1.5Gb of data for over 227 millions of words, and the DB is quite big and the search are not so fast. But, of course, if you need speed you can use DB like mysql or oracle using a fine tuned configuration (memory index, cache query, partition table, etc.).
In this case, however, is possible create an optimal solution (without the generic DB overhead), but you need to create a specific software to handle a very very big dictionary.
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: MilesAhead on May 04, 2011, 03:29 PM
Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc..  my last suggestion would be only store the "base" of the dictionary entry and generate the variations.  That way you'd only have to store "user" on disk and have the algorithm generate all the offshoots.

I'm no DB expert. Haven't read any Codd in over 20 years. I think I'm at the limit of what I can contribute. :)

(http://smileys.smilchat.net/emoticon/hygiene/jacuzzi.gif)
Title: Re: IDEA: Compare dictionary (text) files and remove duplicate entries
Post by: DK2IT on May 05, 2011, 07:09 AM
Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc..  my last suggestion would be only store the "base" of the dictionary entry and generate the variations.
And that can be an interesting idea  :Thmbsup: