topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday October 5, 2024, 8:10 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: IDEA: Compare dictionary (text) files and remove duplicate entries  (Read 30170 times)

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
How would I use/apply this? :)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker.  For example AutoIt3 has the user defined function _FileReadToArray().  If the file fits in ram it's trivial. Each array element is a line of the file.  Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.


MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Just for grins I started a thread on a db forum:

http://forums.databa...?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!



bhuiraj

  • Supporting Member
  • Joined in 2010
  • **
  • default avatar
  • Posts: 21
    • View Profile
    • Donate to Member
How would I use/apply this? :)

I would hazard a guess that most apps that use a "one word per line" flat text file dictionary just suck the whole file into ram and split on the end of line marker.  For example AutoIt3 has the user defined function _FileReadToArray().  If the file fits in ram it's trivial. Each array element is a line of the file.  Most dictionaries I've used are less than 2 MB in size.

You haven't specified if you're using any particular software most often to access this data.


I use different pieces of software, so there isn't one specific one that I only use. All of them support this plain dictionary/wordlist/text file format.

Just for grins I started a thread on a db forum:

http://forums.databa...?p=129101#post129101

If anyone uses the terms "password cracking" or "dictionary attack" you're on your own!!



lol thanks :)

DK2IT

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 14
    • View Profile
    • Donate to Member
How would I use/apply this? :)
Just use some DB manager for SQLite, like this SQLite Database Browser, or the command line version or there are many other programs.

I don't see what you suggested that I didn't already in this post:
https://www.donation....msg245865#msg245865
Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

I added a reply to the "ask the expert" thread I started saying there has to be a "worst case scenario" with keys-only db likely to be it.  That was yesterday. I see it hasn't cleared the moderator. I think they don't really want to bring up the Achilles Heel.  I doubt I'll see my reply.

To really test this out you should have some method that directly accesses the flat file.  Compare it for speed vs. overhead.  A dummy run of a few MB doesn't mean anything.  Just about any manipulation all in ram is going to be fast.  We need a comparison of db and non-db access say for an 8 GB flat file of words.  Then see what happens.

I would tend to guess the db overhead would not be worth the effort compared to direct flat file access and manipulation for simple search. Also I suspect if you made a 34 GB table of keys, the db would crash on the OP's machine. :)

DK2IT

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 14
    • View Profile
    • Donate to Member
Of course this is a quick and fast solution for bhuiraj, it's not optimal, but do not require special software. I've tested 1.5Gb of data for over 227 millions of words, and the DB is quite big and the search are not so fast. But, of course, if you need speed you can use DB like mysql or oracle using a fine tuned configuration (memory index, cache query, partition table, etc.).
In this case, however, is possible create an optimal solution (without the generic DB overhead), but you need to create a specific software to handle a very very big dictionary.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc..  my last suggestion would be only store the "base" of the dictionary entry and generate the variations.  That way you'd only have to store "user" on disk and have the algorithm generate all the offshoots.

I'm no DB expert. Haven't read any Codd in over 20 years. I think I'm at the limit of what I can contribute. :)



DK2IT

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 14
    • View Profile
    • Donate to Member
Where many of the entries are variations on the same base, user01 user02 user1979 user1980 etc..  my last suggestion would be only store the "base" of the dictionary entry and generate the variations.
And that can be an interesting idea  :Thmbsup: