ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

IDEA: Compare dictionary (text) files and remove duplicate entries

(1/7) > >>

bhuiraj:
IT security is a hobby of mine and one aspect requires building large (from hundreds of megabytes to several gigabytes) dictionary (text) files in the format of one word on each line. For example:

donation
coder
<3
cat
dog
Mary

Currently, my collection includes several gigabytes of dictionaries spread out over dozens of text files (the largest being over 30GB, while most are between a few MB and 1.5GB). The problem is that I waste days and even weeks of processing time because of all the duplicate entries between files. So, I need a program that will let me select a master list and multiple additional lists and do any of the following (all functions would be ideal, but it's entirely up to whatever you are comfortable doing and consider a coding snack; if you can code an app to do all of these functions and they work with large text files, I can make a small donation to you as a thank you):
1) compare the master list to the additional lists and remove the duplicate entries from the additional lists
2) compare the additional lists to the master list and remove the duplicate entries from the master list (the opposite of #2)

I have tried a couple public programs, but they crash/error out when I try to use multiple files or files that are over a couple hundred megabytes. I came across posts suggesting that it might be because of RAM limitations, but I don't think that this is the case, since I have over 3GB of RAM.

I hope that someone will be able to do this and apologize if I have not given enough detail. Please let me know if you have any questions about this. Thank you!

skwire:
Can you rar/7zip some sample files and make them available?  If you don't have any webspace, I can PM you details to FTP them directly to my server.

bhuiraj:
Sure :) I keep 7zipped copies of most of them since they can be so large when uncompressed. I can upload a couple files, but wouldn't be able to upload the really big lists because of my connection speed. It would be faster for me to give you the links to some sample dictionaries:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
ftp://ftp.openwall.com/pub/wordlists/all.gz
http://dl.free.fr/hmUZYo0GE/theargonlistver2.zip (click on the small text link "Télécharger ce fichier" to download)

But, let me know if you need me to upload some other lists I have or if you have any questions. Thanks :)

bhuiraj:
There are also more dictionaries here:
http://wordlist.sourceforge.net/
ftp://nic.funet.fi/pub/unix/security/dictionaries/DEC-collection/
http://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/
http://www.wuala.com/Lifehacker%20Fun%20File%20Swap/Documents/Huge%20Word%20List.txt?lang=en
http://www.megaupload.com/?d=3YQT6SRB (a big, very compressible dictionary)
http://mirror.sweon.net/madchat/crypto/wordlists/
http://www.mirrorservice.org/sites/ftp.wiretapped.net/pub/security/info/reference/wordlists/

rjbull:
What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles.  It's currently a 16-bit DOS program; details here.  Even so, it might be worth trying, as it's very fast and powerful.  You can combine files and remove duplicates in line, which would be ideal for you.

Navigation

[0] Message Index

[#] Next page

Go to full version