ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

IDEA: Compare dictionary (text) files and remove duplicate entries

<< < (4/7) > >>

cmpm:
free
http://www.prestosoft.com/edp_examdiff.asp

not free
http://www.ultraedit.com/products/ultracompare.html

http://www.scootersoftware.com/compare-text-files.php

the only programs in my bookmarks that might help

MilesAhead:
Just for chuckles I started a thread on W7 forum.  See if anyone adds a useful suggestion there:

http://www.sevenforums.com/software/157723-interesting-db-dictionary-problem.html#post1353124

bhuiraj:
Thank you for all the help, guys :). I will continue to look for a solution and follow the w7 thread. Please continue posting if you have any thoughts or new ideas.

DK2IT:
Well, there is no problem to have a DB with the keyword as index key, the same system is the base for a search engine.
So I think you can find some DB software (I can suggest SQLite based that is a DB very fast and small, like SQLite Database Browser) and start to insert the keywords.
And if you create the field for keyword as unique, you cannot insert duplicates.
Of course, if you create an index on the keyword to have more speed in the search, that index will take up additional disk space.
I think this is the most efficient system.
Otherwise, you can use your system, maybe splitting words in several files, named with the starting letter of the keyword (A.txt for all the words starting with A, B.txt for words starting with b, etc.).
And if you think can be useful, I already have some little tool to compare and merge two text files with no duplicate. Of course haven't tried on very BIG text file.
However, I strong suggest you the first solution, that's more pratical, I think.

MilesAhead:
The trouble with using a relational database in this case is the key is the only data.  You're not saving anything or creating any efficiency.  If I have 3 paragraphs of text with key "ProfIrwinCoreySpeach" then I can use the key to get the data.  With the dictionary, there is no other data.  There's nothing to optimize.

You could split the files according to first character as you say.  But the only good way would be to build as you go along.  Having 33 GB of spaghetti data to unravel at the start makes it unwieldy.  You need an engine that builds the files from scratch and feed it words.  Even then when the letter files get too big to hold in ram it's going to be slow because you'll have to either create space for insertions according to binary search, or try to use some fixed size average word length which will waste space.

I think it's the case with no easy solution.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version