Author Topic: IDEA: Compare dictionary (text) files and remove duplicate entries (Read 28859 times)

MilesAhead · « **Reply #25 on:** May 02, 2011, 05:41 PM »

How would I use/apply this?
-bhuiraj (May 02, 2011, 05:33 PM)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker. For example AutoIt3 has the user defined function _FileReadToArray(). If the file fits in ram it's trivial. Each array element is a line of the file. Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.

MilesAhead · « **Reply #26 on:** May 02, 2011, 06:13 PM »

Just for grins I started a thread on a db forum:

http://forums.databa...?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!

bhuiraj · « **Reply #27 on:** May 02, 2011, 06:51 PM »

How would I use/apply this?
-bhuiraj (May 02, 2011, 05:33 PM)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker. For example AutoIt3 has the user defined function _FileReadToArray(). If the file fits in ram it's trivial. Each array element is a line of the file. Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.

-MilesAhead (May 02, 2011, 05:41 PM)

I use different pieces of software, so there isn't one specific one that I only use. All of them support this plain dictionary/wordlist/text file format.

Just for grins I started a thread on a db forum:

http://forums.databa...?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!

-MilesAhead (May 02, 2011, 06:13 PM)

lol thanks

DK2IT · « **Reply #28 on:** May 03, 2011, 03:03 AM »

How would I use/apply this?
-bhuiraj (May 02, 2011, 05:33 PM)

Just use some DB manager for SQLite, like this SQLite Database Browser, or the command line version or there are many other programs.

I don't see what you suggested that I didn't already in this post:
https://www.donation....msg245865#msg245865
-MilesAhead (May 02, 2011, 05:36 PM)

Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

MilesAhead · « **Reply #29 on:** May 03, 2011, 07:24 PM »

Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

I added a reply to the "ask the expert" thread I started saying there has to be a "worst case scenario" with keys-only db likely to be it. That was yesterday. I see it hasn't cleared the moderator. I think they don't really want to bring up the Achilles Heel. I doubt I'll see my reply.

To really test this out you should have some method that directly accesses the flat file. Compare it for speed vs. overhead. A dummy run of a few MB doesn't mean anything. Just about any manipulation all in ram is going to be fast. We need a comparison of db and non-db access say for an 8 GB flat file of words. Then see what happens.

I would tend to guess the db overhead would not be worth the effort compared to direct flat file access and manipulation for simple search. Also I suspect if you made a 34 GB table of keys, the db would crash on the OP's machine.

DK2IT · « **Reply #30 on:** May 04, 2011, 12:48 PM »

Of course this is a quick and fast solution for bhuiraj, it's not optimal, but do not require special software. I've tested 1.5Gb of data for over 227 millions of words, and the DB is quite big and the search are not so fast. But, of course, if you need speed you can use DB like mysql or oracle using a fine tuned configuration (memory index, cache query, partition table, etc.).
In this case, however, is possible create an optimal solution (without the generic DB overhead), but you need to create a specific software to handle a very very big dictionary.

MilesAhead · « **Reply #31 on:** May 04, 2011, 03:29 PM »

Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc.. my last suggestion would be only store the "base" of the dictionary entry and generate the variations. That way you'd only have to store "user" on disk and have the algorithm generate all the offshoots.

I'm no DB expert. Haven't read any Codd in over 20 years. I think I'm at the limit of what I can contribute.

DK2IT · « **Reply #32 on:** May 05, 2011, 07:09 AM »

Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc.. my last suggestion would be only store the "base" of the dictionary entry and generate the variations.

And that can be an interesting idea

Author Topic: IDEA: Compare dictionary (text) files and remove duplicate entries (Read 28859 times)

MilesAhead

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

MilesAhead

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

bhuiraj

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

DK2IT

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

MilesAhead

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

DK2IT

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

MilesAhead

Re: IDEA: Compare dictionary (text) files and remove duplicate entries

DK2IT

Re: IDEA: Compare dictionary (text) files and remove duplicate entries