DonationCoder.com Software > Post New Requests Here
IDEA: Compare dictionary (text) files and remove duplicate entries
DK2IT:
The trouble with using a relational database in this case is the key is the only data. You're not saving anything or creating any efficiency. If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data. With the dictionary, there is no other data. There's nothing to optimize.
--- End quote ---
What do you mean?
According to your example, you have this huge wordlist:
ab
aba
abaca
abacas
abaci
aback
abacus
abacuses
...
etc. etc.
that can be insert into a database creating a table with only one field, something like that:
TABLE wordlist (
word VARCHAR
)
That's all.
Or, I do not have well understood your problem :(
You could split the files according to first character as you say. But the only good way would be to build as you go along. Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.
--- End quote ---
Of course, once you have the file well organized, the next time you add new keyword you must insert in the correct way. Or, you can have some background process to sort the file for binary search. Or, you can create index file (to make a fast search) to unordered data.
MilesAhead:
Ok since you have the solution please provide it to the OP.
I'd be interested in the performance of a DB that's both keys only and many times larger than physical ram.
edit: in any case it seems a chore to try to massage the data in one go. Probably a better approach is to grab a chunk and sort it, then add entries as a background task. How to go about it may depend on the tools the OP is currently using. Perhaps the big word list can be built while still being usable by the front end tools.
I haven't been able to get any hits searching on this type of problem. Maybe it's not one RDBMS companies like to mention as I imagine it's a worst case scenario. All keys, nothing else to look up.
DK2IT:
Well, I've made a test.
Maybe SQLITE is not so optimized as filesize, but it's very fast to insert and find words.
The filesize is about two times the TXT. And using index to make faster search, the size is four times!
However, with big storage size and on a NTFS volume using file compression, the filesize should not be a problem.
Here's my test: a file (in sqlite) with about 5,6 millions of words - maybe there are duplicates, but I've made a very quick import.
The search is very quickly, also using the "slow" LIKE.
bhuiraj:
How would I use/apply this? :)
MilesAhead:
I don't see what you suggested that I didn't already in this post:
https://www.donationcoder.com/forum/index.php?topic=26416.msg245865#msg245865
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version