topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 3:41 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Idea request: "Site discovery tool"  (Read 9867 times)

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Idea request: "Site discovery tool"
« on: November 05, 2007, 08:03 AM »
I'm trying to find a pragmatic solution to the following task:

Find new pages/sites/posts around a set of topics in as automated a way as possible, in an ongoing way.
By new I mean either is newly arrived/posted, or not seen before because too "deep" in the "dark" web so we havent gotten to the 21000th page on google where they are...

So in a way it would be a tool that
- is able to trawl a bunch of directories, blogs, feeds
- is also able to run a whole list of searches in google/yahoo/ask etc. or some aggregator sites, possibly starting at page 100 to avoid all the noisy stuff at the top
- gather all URLs referred in this
- then it should be able to discard any site/url previously seen aka 'greylist') so as to produce a list of new things to review
- update the greylist

In a way this combines features from a search tool, and update watcher tool, a web ferret, and URL catcher etc.

If I had to write it I would probably write a perl application with a topic manager, a web crawler, possibly even plug in a bayesian toy to help ignore spammy sites, a results list with preview... But I dont have time, and I am not sure this is something I can just commission and hope to get something workable (= better than using manual methods) at a reasonable cost.

So it got me thinking about how some of the benefits of this tool might be achieved by combining existing tools, say by plugging website watcher with search engine results URLs with the different search terms hard coded in each. Now this doesn't quite work, as WSW would highlight too many other changes in content rather than purely new stuff, but there just might be ways...

So it's not really a coding snack, it's more of a software jigsaw puzzle.

Veign

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 993
    • View Profile
    • Veign - Where design meets development
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #1 on: November 05, 2007, 08:17 AM »
What about using something like Google Alerts:
http://www.google.com/alerts

I have 100's of alerts I monitor.  Mainly my stuff but also some competitor and client stuff too.

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #2 on: November 05, 2007, 08:25 AM »
I can't remember why i didnt like it the last time, could have just been my mistrust about google or it didnt go "deep" enough. But I will try it again since you say it can work with hundreds of them  8).

Thanks :)

justice

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,898
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #3 on: November 05, 2007, 08:42 AM »
What about http://technorati.com/ search for a subject and you see the updates regarding this subject on your screen, there's even a watcher. However note most if not all the results are from blogs, so that might not be suitable for your needs.

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #4 on: November 05, 2007, 08:46 AM »
Technorati has quite a niche focus, but I see what you mean.

We already use a lot of feed aggregators and the like. Our guy already checks a lot of these, and hangs out in forums etc. But I am trying to make his job easier as it can easily takes hours out of the day just to find a few tidbits of missing informaton.

I guess what I want is something to help us find the bits that dont make it on to these. The pages that are too specialist, or have nothing funny or trendy, but have real useful information to our topic. Pages that are at the bottom of technorati or similar sites, sites that are on page 2000 on google, sites that havent yet made it, hidden on information rich pages on company websites.


icekin

  • Supporting Member
  • Joined in 2007
  • **
  • default avatar
  • Posts: 264
    • View Profile
    • icekin.com Technology,Computers and the Internet
    • Read more about this member.
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #5 on: November 05, 2007, 09:20 AM »
Sometime back, I saw this site on Searchbots (http://www.searchbots.net/) The idea is that you list some topics and build a searchbot which goes off and searches around for content. Over time, it gathers more content and then aggregates them. Every day, it returns fresh content.

Pubsub (http://www.pubsub.com) was another search engine that could return results based on current recently indexed news content. For finding those high quality, yet hard to locate sites, I suggest directories over search engines. Complete Planet (http://www.completeplanet.com) has a listing of several specialist directories on the internet.

I wrote an article about searching the internet about 2 years ago, but I failed to maintain my site and it was taken down. You can find its last version on the WaybackMachine : http://web.archive.o...rg/look_at_searching

Some of the stuff is a bit outdated, but much of it can still be used.

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #6 on: November 05, 2007, 10:57 AM »
Thanks icekin, I can see you spent a lot of time thinking about these things  :Thmbsup:

choicefresh

  • Participant
  • Joined in 2007
  • *
  • Posts: 20
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #7 on: November 19, 2007, 07:11 PM »
Have you heard of StumbleUpon??? That sounds like exactly what you're looking for...

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #8 on: November 21, 2007, 05:24 PM »
Google alerts certainly cannot do the job - they illustrate everything that is wrong about google's approach when you go for more topical/targeted information.

I have heard of stumble upon, but... I want the bits that arent popular enough in topics "light" enough to make it through filters like stumble upon... we're talking professional topics, and not in web dev or something popular like that, and stumble upon stumbles a bit there...

Or in a way I am looking for a way to automate what we are doing now, which equates to trawling numerous blogs and forums, subscribe to newsfeeds, blogrools, press release services, do regular deep searches, watch 100 companies... to find tidbits of information, pages, white papers etc. which are in the dark/deep web. The opposite of what digg/technorati/stumble upon do, which tries to fload some things to the surface because they are fun/cool/intriguing to a large number of people.

Some bits I know how to improve, using watchers, feed readers, search tools - but especially when it comes to the search tool I would like something more clever than the usual search. Something that remembers what i have already found before and doesnt show it again unless it has changed a lot. Something that can blacklist things and ignore things and make it easier for me to see stuff that i havent seen before...

I guess it's a "i wish i still had time to code" moment :S
« Last Edit: November 21, 2007, 05:28 PM by iphigenie »

app103

  • That scary taskbar girl
  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 5,884
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #9 on: November 22, 2007, 08:20 PM »
You might want to give Copernic Agent a try. Professional edition uses multiple search engines to find results, specialized search engines based on the type of info you are looking for (you can even do patent and ftp searches with it), tracks results, etc. This was THE way of finding info back when there wasn't a Google to use.

You could combine it with something like Newzie, which has features for alerting for certain keywords in rss feeds, which it can do at every feed update. It can also act as a site watcher, notifying you of not only changes on a site, but I believe it can notify of changes that trigger the same keywords alerts you have set up for feeds. (Newzie is pretty powerful!)

iphigenie

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,170
    • View Profile
    • Donate to Member
Re: Idea request: "Site discovery tool"
« Reply #10 on: November 26, 2007, 10:35 AM »
What I really really dream of is the Copernic Agent but with the extra feature of "hide sites I have already seen"