ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Idea request: "Site discovery tool"

(1/3) > >>

iphigenie:
I'm trying to find a pragmatic solution to the following task:

Find new pages/sites/posts around a set of topics in as automated a way as possible, in an ongoing way.
By new I mean either is newly arrived/posted, or not seen before because too "deep" in the "dark" web so we havent gotten to the 21000th page on google where they are...

So in a way it would be a tool that
- is able to trawl a bunch of directories, blogs, feeds
- is also able to run a whole list of searches in google/yahoo/ask etc. or some aggregator sites, possibly starting at page 100 to avoid all the noisy stuff at the top
- gather all URLs referred in this
- then it should be able to discard any site/url previously seen aka 'greylist') so as to produce a list of new things to review
- update the greylist

In a way this combines features from a search tool, and update watcher tool, a web ferret, and URL catcher etc.

If I had to write it I would probably write a perl application with a topic manager, a web crawler, possibly even plug in a bayesian toy to help ignore spammy sites, a results list with preview... But I dont have time, and I am not sure this is something I can just commission and hope to get something workable (= better than using manual methods) at a reasonable cost.

So it got me thinking about how some of the benefits of this tool might be achieved by combining existing tools, say by plugging website watcher with search engine results URLs with the different search terms hard coded in each. Now this doesn't quite work, as WSW would highlight too many other changes in content rather than purely new stuff, but there just might be ways...

So it's not really a coding snack, it's more of a software jigsaw puzzle.

Veign:
What about using something like Google Alerts:
http://www.google.com/alerts

I have 100's of alerts I monitor.  Mainly my stuff but also some competitor and client stuff too.

iphigenie:
I can't remember why i didnt like it the last time, could have just been my mistrust about google or it didnt go "deep" enough. But I will try it again since you say it can work with hundreds of them  8).

Thanks :)

justice:
What about http://technorati.com/ search for a subject and you see the updates regarding this subject on your screen, there's even a watcher. However note most if not all the results are from blogs, so that might not be suitable for your needs.

iphigenie:
Technorati has quite a niche focus, but I see what you mean.

We already use a lot of feed aggregators and the like. Our guy already checks a lot of these, and hangs out in forums etc. But I am trying to make his job easier as it can easily takes hours out of the day just to find a few tidbits of missing informaton.

I guess what I want is something to help us find the bits that dont make it on to these. The pages that are too specialist, or have nothing funny or trendy, but have real useful information to our topic. Pages that are at the bottom of technorati or similar sites, sites that are on page 2000 on google, sites that havent yet made it, hidden on information rich pages on company websites.

Navigation

[0] Message Index

[#] Next page

Go to full version