I'm trying to find a pragmatic solution to the following task:
Find new pages/sites/posts around a set of topics in as automated a way as possible, in an ongoing way.
By new I mean either is newly arrived/posted, or not seen before because too "deep" in the "dark" web so we havent gotten to the 21000th page on google where they are...
So in a way it would be a tool that
- is able to trawl a bunch of directories, blogs, feeds
- is also able to run a whole list of searches in google/yahoo/ask etc. or some aggregator sites, possibly starting at page 100 to avoid all the noisy stuff at the top
- gather all URLs referred in this
- then it should be able to discard any site/url previously seen aka 'greylist') so as to produce a list of new things to review
- update the greylist
In a way this combines features from a search tool, and update watcher tool, a web ferret, and URL catcher etc.
If I had to write it I would probably write a perl application with a topic manager, a web crawler, possibly even plug in a bayesian toy to help ignore spammy sites, a results list with preview... But I dont have time, and I am not sure this is something I can just commission and hope to get something workable (= better than using manual methods) at a reasonable cost.
So it got me thinking about how some of the benefits of this tool might be achieved by combining existing tools, say by plugging website watcher with search engine results URLs with the different search terms hard coded in each. Now this doesn't quite work, as WSW would highlight too many other changes in content rather than purely new stuff, but there just might be ways...
So it's not really a coding snack, it's more of a software jigsaw puzzle.