Show Posts - DK2IT

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

« on: May 03, 2011, 03:03 AM »

How would I use/apply this? :)
-bhuiraj (May 02, 2011, 05:33 PM)

Just use some DB manager for SQLite, like this SQLite Database Browser, or the command line version or there are many other programs.

I don't see what you suggested that I didn't already in this post:
https://www.donationcoder.com/forum/index.php?topic=26416.msg245865#msg245865
-MilesAhead (May 02, 2011, 05:36 PM)

Nothing of new, just a real implementation, because we don't know how fast is a DB with a keyword as a key. And I can say that is very fast and do not need so much ram, but need hard disk space. Maybe enterprise DB (like Oracle/MySQL/etc.) can handle GB of data better than SQLite, but the system is the same.
Of course, you must find the right program to handle, because some GUI App (like SQLite DB Browser) load the file into ram and need over 1GB for that file of 100MB. The command line version, need only about 3MB instead.

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

« on: May 02, 2011, 04:34 PM »

Well, I've made a test.
Maybe SQLITE is not so optimized as filesize, but it's very fast to insert and find words.
The filesize is about two times the TXT. And using index to make faster search, the size is four times!
However, with big storage size and on a NTFS volume using file compression, the filesize should not be a problem.
Here's my test: a file (in sqlite) with about 5,6 millions of words - maybe there are duplicates, but I've made a very quick import.
The search is very quickly, also using the "slow" LIKE.

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

« on: May 01, 2011, 05:33 AM »

The trouble with using a relational database in this case is the key is the only data. You're not saving anything or creating any efficiency. If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data. With the dictionary, there is no other data. There's nothing to optimize.

What do you mean?
According to your example, you have this huge wordlist:
ab
aba
abaca
abacas
abaci
aback
abacus
abacuses
...
etc. etc.

that can be insert into a database creating a table with only one field, something like that:

TABLE wordlist (
word VARCHAR
)

That's all.
Or, I do not have well understood your problem :(

You could split the files according to first character as you say. But the only good way would be to build as you go along. Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.

Of course, once you have the file well organized, the next time you add new keyword you must insert in the correct way. Or, you can have some background process to sort the file for binary search. Or, you can create index file (to make a fast search) to unordered data.

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

« on: April 30, 2011, 11:37 AM »

Well, there is no problem to have a DB with the keyword as index key, the same system is the base for a search engine.
So I think you can find some DB software (I can suggest SQLite based that is a DB very fast and small, like SQLite Database Browser) and start to insert the keywords.
And if you create the field for keyword as unique, you cannot insert duplicates.
Of course, if you create an index on the keyword to have more speed in the search, that index will take up additional disk space.
I think this is the most efficient system.
Otherwise, you can use your system, maybe splitting words in several files, named with the starting letter of the keyword (A.txt for all the words starting with A, B.txt for words starting with b, etc.).
And if you think can be useful, I already have some little tool to compare and merge two text files with no duplicate. Of course haven't tried on very BIG text file.
However, I strong suggest you the first solution, that's more pratical, I think.

Messages - DK2IT [ switch to compact view ]

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries

Post New Requests Here / Re: IDEA: Compare dictionary (text) files and remove duplicate entries