However I think the algorithm for keyword extraction would not even have to be that complicated.
I'm a bit dubious on that one. If it's a client-supplied list, no problem. But if it's based on frequency, that's a horse of a different color. The critical keyword(s) for a given text might not be as high in frequency as many other words.
For instance, you're reading an article on The Indigent Population of Certain Polynesian Islands
. Indigent would certainly be a keyword, but might never be mentioned save in the title. References within the article might be native, poor, disadvantaged
and the like. So you'd need a pretty strong thesaurus algorithm to catch the proper keywords, ones relevant to the thrust of the article, for an appropriate summary.
It's fairly easy to condense an article, not so easy to have the condensed version contain the meat of the original article. We had a team working on that for some documentation several years ago. Five members, as I recall, and they had difficulty reaching consensus on distillations even when they were involved in open discussion of the material to be condensed. I'd hate to have to try to write that program, alone or in a dev team. 'Twould be an interpretational nightmare, methinks.
Unlike Web page keywords, frequency in an article is not necessarily indicative of importance or relevance. I suppose, if the title were true to the purpose of the article, you could extract keywords from it, then do a thesaurus lookup for relevant words, but even that would be a nasty job, since many thesaurus entries would not be meaningful to the article's purpose.