IDEA: Compare dictionary (text) files and remove duplicate entries

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

<< < (3/7) > >>

bhuiraj:
I'm not quite getting what you want for the end result. If you continually filter the additional lists to remove entries found in the master list, wouldn't it result in just having a master list? Since the master has all and all sublists are stripped of entries in the master, would not all subsidiary lists eventually be empty?

-MilesAhead (April 17, 2011, 04:57 PM)
--- End quote ---

The problem is that ending up with a 40 or 50GB dictionary is unwieldy, to say the least. This is a problem because I am constantly updating the dictionary. An example of the purpose of being able to compare lists is if I have already tested with a 5GB dictionary and I find another dictionary that I also want to use to test the target, I can compare the new list to the original 5GB list and remove duplicate entries (dictionary entries that were already tested in the 5GB list) so that I am not retesting those same entries when I test with the new list. I have dozens of dictionaries and there are millions of duplicate entries between them, which is not at all efficient.

Right now, I am removing duplicates from within a 30+GB dictionary (using "sort -u") and it has been running for almost 36 hours already. So, it is not practical to merge every new dictionary with an existing large dictionary and then remove duplicate entries within that file.

Also, you can ignore function #1 (I started using "copy *.txt output.txt" a few days ago and will remove that feature from my post now).

MilesAhead:
Sounds like you need to incorporate some type of database engine. The efficient access to large sets of keys is done by it. How to access from ad hoc outside apps is another matter. But giant flat text files seem to have reached the practical limit for your purposes.

I would search around for a forum where people deal exclusively or mainly with db application issues. Once you have an idea of workable db storage then you may be able to describe a small glue app that can bridge the gap between the db and the apps that want to use it.

Thing is with the way you describe it, it sounds like the key "indexing" or sorting would have to be done repeatedly. Don't know if even a free db would do that much good. Seems too scattered to lend itself to streamlining.

Renegade:
I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon. :)

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

bhuiraj:
I can't imagine how you could possibly get a "dictionary" that's 40 or 50GB. That's more than enough for every word in every language. Including Klingon. :)

It must be almost entirely duplicates. Or it's not a dictionary in the traditional sense. Is it an actual dictionary, with definitions, or a word list? It sounds like a word list for dictionary attacks.

-Renegade (April 17, 2011, 07:41 PM)
--- End quote ---
I was always under the impression that dictionary = wordlist. Anyway yes, I use them for auditing passwords of friends and family. Unfortunately, a lot of people (including my brother and parents) don't understand the importance of using a *unique* and *secure* password until I demonstrate for them how quickly their static password of choice ("password", "john1970", and so forth) can be compromised (it took ~3 seconds for my mom's and 5 minutes seconds for my dad's using a 700mb dictionary).

Also, it doesn't look like the entire 30+GB files is duplicates (though, there are a lot of junk entries).

MilesAhead:
It's an interesting problem because most db are made to have keys to look up some associated data. In this case the "keys" are the data.

Instead of sorting a giant file you might look for an approach to sort a smaller "seed" file, then find some routine that adds to sorted flat files via binary search. The routine would do a binary search, if the word is not found, insert it in sorted position etc..

Still, large files on slow disks are a problem. Growing a small file until it reaches some optimum size may be an approach. At some point as you say the text file becomes unwieldy.

If anyone has already done it though, it's probably some db guru.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version