avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Wednesday January 27, 2021, 2:46 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - bhuiraj [ switch to compact view ]

Pages: [1]
How would I use/apply this? :)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker.  For example AutoIt3 has the user defined function _FileReadToArray().  If the file fits in ram it's trivial. Each array element is a line of the file.  Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.

I use different pieces of software, so there isn't one specific one that I only use. All of them support this plain dictionary/wordlist/text file format.

Just for grins I started a thread on a db forum:


If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!

lol thanks :)

How would I use/apply this? :)

Out of curiosity, could you see the methodology?  I would think something that big would have to use some type of merge sort.  Esp. if you only have one disk that would be thrash city.

In cygwin's tmp folder, I noticed that the sort process created several smaller temporary files that were merged together in stages. My thought was that it splits the large dictionary file up into many smaller files of approximately the same size (some dictionaries were split up into multiple 64MB files), compares them, merges some of them together, compares again, and so forth until it ends up with two files that are half the size of the complete unduped file. Then, those two files are merged together to create the final dictionary with the duplicates removed.

In case anyone was wondering, it would take well in excess of a week to sort a 33GB dictionary. I started sorting my 33GB file on April 15th and finally cancelled it today (after 9 days) not even half done.

Thank you for all the help, guys :). I will continue to look for a solution and follow the w7 thread. Please continue posting if you have any thoughts or new ideas.

I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon. :)

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

I was always under the impression that dictionary = wordlist. Anyway yes, I use them for auditing passwords of friends and family. Unfortunately, a lot of people (including my brother and parents) don't understand the importance of using a *unique* and *secure* password until I demonstrate for them how quickly their static password of choice ("password", "john1970", and so forth) can be compromised (it took ~3 seconds for my mom's and 5 minutes seconds for my dad's using a 700mb dictionary).

Also, it doesn't look like the entire 30+GB files is duplicates (though, there are a lot of junk entries).

I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list?  Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?

The problem is that ending up with a 40 or 50GB dictionary is unwieldy, to say the least. This is a problem because I am constantly updating the dictionary. An example of the purpose of being able to compare lists is if I have already tested with a 5GB dictionary and I find another  dictionary that I also want to use to test the target, I can compare the new list to the original 5GB list and remove duplicate entries (dictionary entries that were already tested in the 5GB list) so that I am not retesting those same entries when I test with the new list. I have dozens of dictionaries and there are millions of duplicate entries between them, which is not at all efficient.

Right now, I am removing duplicates from within a 30+GB dictionary (using "sort -u") and it has been running for almost 36 hours already. So, it is not practical to merge every new dictionary with an existing large dictionary and then remove duplicate entries within that file.

Also, you can ignore function #1 (I started using "copy *.txt output.txt" a few days ago and will remove that feature from my post now).

It looks like CMSort is for removing duplicates from within a single file (something I am doing with cygwin and "sort -u") and can't compare two or more files and remove duplicates from one.

sort -u input.txt >output.txt (in Cygwin) is very fast and working great. Thank you for pointing me towards the ported nix apps.

@Moderator: Please close this thread.

Assuming you are running Windows, I'd look for a port of Linux utilities.  Linux has many tools designed for the output of one to be piped to the input of another, rather than loading everything into ram.

Searching around the solutions I found ranged from creating a database, to do it yourself merge sort etc.. but the simplest was a Linux command line solution.

Check for Windows ports of the sort and uniq commands.

edit: the discusion on this thread may give you some ideas:

I ended up installing Cygwin and I thought that "uniq -u input.txt output.txt" seemed to work, but since I can't open the large file and check myself, I decided to make a sample file and test it. Unfortunately, after running "uniq -u input.txt output.txt", it looks like it didn't remove unique entries and instead just copied input.txt to output.txt:




"sort -u input.txt output.txt" seems to work, but it also organizes the lines in alphabetical order, which I don't need. Is there any way to have it only show unique lines, but not alphabetize everything?

Post New Requests Here / Re: PlayDead
« on: April 15, 2011, 11:10 AM »
I may have a more efficient solution, if you have the lungs and the attitude to do it...

Imagine that you are in an extremely noisy environment but you very life is depending on the computer, and then...
"tell" her to keep off that machine!

 (see attachment in previous post)  (see attachment in previous post)  (see attachment in previous post)  (see attachment in previous post)  (see attachment in previous post)  (see attachment in previous post)

Why do you assume it's a "her"? :P

Post New Requests Here / Re: PlayDead
« on: April 15, 2011, 11:08 AM »
I use a hand-drawn skull and crossbones and the text "Do not touch or else" on a piece of printer paper taped to the monitor.
The screen is then locked (Win+L) so the process continues in the background, but it doesn't display and only a valid login can 'unlock' it.
I have found this to actually be a very effective solution.

I need a small app that can remove duplicate entries from within a single large (anywhere from a couple hundred MB to several GB in size) text file. I have tried using several apps to do this, but they all crash, possibly because they run out of free RAM to use (loading an 800MB text file results in over 1.5GB of RAM being used by the apps). Text files are in the format of one word per line:


If there is some workaround (including splitting files up into multiple parts and then comparing each one to the rest once to make sure there are no duplicates), it would be great.

Please see https://www.donation...ex.php?topic=26416.0 for sample dictionary files and let me know if you have any questions. Thank you!

What you really need is a version of RPSORT re-compiled for 32-bit Windows consoles.  It's currently a 16-bit DOS program; details here.  Even so, it might be worth trying, as it's very fast and powerful.  You can combine files and remove duplicates in line, which would be ideal for you.
Looks like a great app, but I tried running it in Windows 95 compatibility mode and it threw the error "ERROR 043: Line exceeds max length of 32750 bytes."

Sure :) I keep 7zipped copies of most of them since they can be so large when uncompressed. I can upload a couple files, but wouldn't be able to upload the really big lists because of my connection speed. It would be faster for me to give you the links to some sample dictionaries: (click on the small text link "Télécharger ce fichier" to download)

But, let me know if you need me to upload some other lists I have or if you have any questions. Thanks :)

IT security is a hobby of mine and one aspect requires building large (from hundreds of megabytes to several gigabytes) dictionary (text) files in the format of one word on each line. For example:


Currently, my collection includes several gigabytes of dictionaries spread out over dozens of text files (the largest being over 30GB, while most are between a few MB and 1.5GB). The problem is that I waste days and even weeks of processing time because of all the duplicate entries between files. So, I need a program that will let me select a master list and multiple additional lists and do any of the following (all functions would be ideal, but it's entirely up to whatever you are comfortable doing and consider a coding snack; if you can code an app to do all of these functions and they work with large text files, I can make a small donation to you as a thank you):
1) compare the master list to the additional lists and remove the duplicate entries from the additional lists
2) compare the additional lists to the master list and remove the duplicate entries from the master list (the opposite of #2)

I have tried a couple public programs, but they crash/error out when I try to use multiple files or files that are over a couple hundred megabytes. I came across posts suggesting that it might be because of RAM limitations, but I don't think that this is the case, since I have over 3GB of RAM.

I hope that someone will be able to do this and apologize if I have not given enough detail. Please let me know if you have any questions about this. Thank you!

DC Gamer Club / Re: Humble Indie Bundle (pay what you want sale)
« on: April 13, 2011, 02:26 AM »
I buy these bundles on principle alone.  Good stuff.   :Thmbsup:
Agreed. I still haven't installed any of the apps from the last bundle and probably won't even install the ones from this bundle, but I like encouraging these kinds of offers. Actually, I bought two bundles this time around, so if anyone truly wants the games and doesn't have any way to purchase it from the website directly, I can give them my second key as an early XMas present.

Have any of the other people that won Genie Timeline Professional received an email with the app/info from the company? After I didn't get an email from them for a couple days, I emailed the contact person listed in the DC email almost a week ago and never received a response. I emailed Carl last night, so hopefully he'll get back to me. :)

Thank you for organizing and running this awesome giveaway, mouser and thank you to the donators for your donations, as well. I still can't believe I won something! (Especially something very useful - it was my #1 choice :) )

Pages: [1]