You ask about a Moby Thesaurus III...
Back in the day, I did most of the work on a Macintosh IIfx (68020 processor at 40 MHz, 8 MB RAM, 80 MB hard drive; even my 60% off employee discount the machine cost about $3,000) Most of my custom software to perform set operations on large (for the day) word data sets was written in C using the Macintosh Programmer's Workshop shell while enjoying the view from Pacific Grove, California and taking care of my two young boys. Since this was the early 90's, the Internet was pretty much limited to universities and larger companies; individuals had to settle for bulletin board systems that were rarely linked to one another.
Now of course we have the WWW and companies such as Google who have billions of dollars and immense server farms ready to do this kind of research:
http://googleresearc...e-belong-to-you.htmlTo lobby for a particular data set, you might contact Peter Norvig: "Dr Peter Norvig has been at Google Inc since 2001 as the Director of Machine Learning, Search Quality, and Research." pnorvig (at) google.com at Google and -- first of all thank him for the above research that has already been donated to the public domain -- and ask him to publish fresh sets of raw language data to the public domain. I would be happy to contract with him to produce this work, if I could use Google resources to assist me.
I spoke with Peter a few years ago when he was using Moby data to construct "the world's longest anagram." I believe he shares a mission to better the resources available to all developers, especially the individual contributor from whom all innovation originates.
My model was to originally license the data to institutions such as MIT, NSA, Getty, Lotus, Apple... and once they had paid for my time in creating the datasets, I put the entire work into the public domain. This I did. I preferred the public domain rather than a FSF license because I really do not want not to impose any limits upon re-use.
If a person wants to profit from re-use of the Moby data, that is perfectly fine by me. My goal was to give developers a "leg up," not to impose some idiosyncratic view about capitalism or freedom. The existence of new works predicated upon Moby makes us all more free to think, to express ourselves, and by subsequent lucid dreaming, to work toward a civilization. My personal thanks comes from enjoying the works by Anderson and others which in turn enrich my life and the lives of others who like to use English. Maybe it will make a crucial difference to someone in expressing a significant idea and in turn cure malaria, measles, HIV/AIDs, or a political conjecture that will save a hundred million lives.
So the shorter answer is that I am not planning personally to release a new data set, but I hope that years of tedious work can be used to bully a proportionate similar contribution by a company such as Google.
Thank you, Anderson, for Mobysaurus.
Grady