Author Topic: links collector (Read 9591 times)

kalos · « **on:** July 08, 2007, 12:20 AM »

hello

1)
is there a way to grab and store (in a txt file) all the "links" or "urls in the text" of all the webpages I visit, that contain a specific string eg urls like www.*.com/*.pdf ?

the program must scan the text and links of all the webpages I visit and if it finds an url of the above mask, it should store it (in a text file)

2)
I would like a program that will store (in a text file) the urls of the webpages I visit that match a specific mask eg www.google.com/*

thanks!

jgpaiva · « **Reply #1 on:** July 08, 2007, 05:43 AM »

I know this'd be pretty easy: the only problem is getting the links itself, the parsing and storing part is pretty trivial.

I don't know how toget that info, though. If anyone knows how to, please just post here.

steeladept · « **Reply #2 on:** July 08, 2007, 10:27 AM »

If you are only asking about links, why not parse out only the <a> tags? The anchor tags have to exist with the correct URL for it to link to another page, so that could quickly and easily filter out everything else. The only problem that would need to be figured out is if someone posted it in plain text without a link.

The logical expression for the filter would be: Find <a, then find href=, then return the url. Match this to the mask and place it in xxxx file.

Unfortunately I don't know any coding for that beyond the code for the links in the html, but I hope that helps.

jgpaiva · « **Reply #3 on:** July 08, 2007, 11:57 AM »

steeladept: I think that kalos is referring to collecting the links as he browses the web, thus, the links would have to be collected directly from the webbrowser or something like that.
At least, that's what i understand

steeladept · « **Reply #4 on:** July 08, 2007, 03:05 PM »

I see what you mean. Still, isn't there a way to choose href from link "on click" or something like that?

kalos · « **Reply #5 on:** July 08, 2007, 05:05 PM »

mm, I cant imagine what the event that will trigger the scanning and grabing would be

I suppose some kind of browser indegration or a way to see what webpages I visit from within the browser

steve_rb · « **Reply #6 on:** July 09, 2007, 07:34 AM »

URL Snooper can nicely grab and list all links on the pages you visit. Just click on sniff and go to your browser and start browsing. Even you can tell URL Snooper to filter listed ;links with any text you want. This is a great software but unfortunatly I wanted it to go through all pages and save links without me surfing those pages. Pitty it can't do this

jgpaiva · « **Reply #7 on:** July 09, 2007, 08:43 AM »

Very good point, steve!
I haven't tried it, but that method does seem to work

kalos · « **Reply #8 on:** July 10, 2007, 09:35 AM »

thanks steve

however I doubt if sniffing is accurate and reliable

as for what you need, a web spider/crawler would do that, but I dont know any good one

kalos · « **Reply #9 on:** July 13, 2007, 05:22 PM »

I get this error

mouser · « **Reply #10 on:** July 13, 2007, 07:07 PM »

someone else was posting about this error before too, what is causing this mystery npptools.dll error!

mouser · « **Reply #11 on:** July 13, 2007, 07:11 PM »

saw this post recently:
http://jwsecure.com/dan/2006/12/

kalos,
Can you try this:

uninstall winpcap (not url snooper) from add/remove programs control panel
reboot
install winpcap 4.1 beta: http://www.winpcap.o...WinPcap_4_1_beta.exe
reboot
see if it works now.

kalos · « **Reply #12 on:** October 30, 2007, 08:39 AM »

mmm

URL Snooper is a great program, but not exactly the way to go for this situation, for two reasons:

1) it doesn't integrate within the browser
2) it doesn't autosave files, text, links
3) it sniffs the network, trying to get "hidden" files, while I only need to just see what is seen whithin the browser, nothing more

I was looking for a firefox extension or an opera javascript that will see the text and links of each webpage I visit and it will save specific links/text/files

there is no need to sniff the network in order to catch "hidden" urls etc, it's overkill

the problem is that most of programs that attempt to do what I need (offline browsers, etc) require you to enter a starting web address, then specify retrieval options and then let the program do the job

but what I want is an integrated to the browser solution to do this within my browser "as I browse"

an auto-bookmarker, an auto-file-saver, an auto-text-saver that will save info as I browse the net automaticaly

thanks

kalos · « **Reply #13 on:** November 17, 2007, 06:20 PM »

first, I need something that will "monitor" every webpage I visit, as I browse the net
this monitoring has to be very accurate ofcourse, which means it must not miss any webpage even the webpages that are partially loaded etc

by "monitoring webpages" I mean to grab the text, links and files of every webpage I visit
by "the text of the webpage", I mean the text that is highlighted/selected when we click ctrl+A in a webpage (included any other "hidden" text, etc)
by "the links of the webpage", I mean the links that are grabbed when we hit ctrl+alt+L in Opera or any other method that shows all the links of the webpage (included any "hidden" links, javascript links, etc)
by "the files of the webpages", I mean the files that are included in the folder that is created when we save a webpage which created an html file and a folder (and any other hidden files, embedded files, etc)

as far as I know (and if you know something else, please inform me) the available methods that can monitor web browser traffic are these:
javascript can monitor webpages as I browse the net (opera, for example, has this javascript function: document.addEventListener('DOMContentLoaded',function() { where it does things when the webpages are loaded)
internet connection sniffer can monitor webpages as I browse the net, that can sniff urls
web proxy can monitor webpages as I browse the net, as it works as a cache proxy

then I need to apply filters to specify which of the text, links and files are useful and then we need to save the filtered information

any help would be much appreciated

thank you

Ralf Maximus · « **Reply #14 on:** November 17, 2007, 08:48 PM »

Does it need to be real-time?

If you're using IE6 or 7, the browser's cache is just a collection of files that can be accessed via the file system. I imagine any file-search utility that does regular expressions (FileLocator Pro?) could suss out the patterns you've described. For a fact I know UltraEdit's file-search feature will do this.

This is not real-time scanning, but you could kick off such a search after your browsing session is complete.

rjbull · « **Reply #15 on:** November 18, 2007, 11:35 AM »

I doubt it will satisfy kalos, but don't forget this:

VisitURL: Flexible, efficient, lightweight bookmark manager

VisitURL is not a fully-fledged bookmark manager. It is not a replacement for Netscape's bookmark file or Internet Explorer's Favorites. It does not organize bookmarks into categories.

VisitURL is designed to help maintain a handy (as in: at hand) list of URLs that you intend to visit. For instance, if a friend sends you an email recommending that a particular URL, Visit is a good place to store the URL until you are ready to launch your browser and go surfing. If you copy the URL to clipboard, Visit will automatically intercept it and save to its database. If you copy several URLs at once, Visit will get all of them. (You may also add URLs manually or directly from an open browser window.) If you copy the URL with some text around it, Visit will optionally treat that text as a description for the URL you copied.

To access the bookmarked site, you can either view the HTML page that Visit creates in your browser, or click a toolbar button to launch the browser directly from Visit. There is no limit to how many URLs you may store, though the program is primarily designed to hold, view and edit a short, temporary list. Netscape and Explorer tend to consume so much system resources that it's not practical to keep them loaded at all times - this is where Visit comes in.

Author Topic: links collector (Read 9591 times)

kalos

links collector

jgpaiva

Re: links collector

steeladept

Re: links collector

jgpaiva

Re: links collector

steeladept

Re: links collector

kalos

Re: links collector

steve_rb

Re: links collector

jgpaiva

Re: links collector

kalos

Re: links collector

kalos

Re: links collector

mouser

Re: links collector

mouser

Re: links collector

kalos

Re: links collector

kalos

Re: links collector

Ralf Maximus

Re: links collector

rjbull

Re: links collector