Author Topic: Page data harvest (Read 8754 times)

jpcook · « **on:** July 16, 2007, 09:52 PM »

Very frequently I would like the ability to harvest data or some information from a page.
Something like Snagit, but simple, less expensive and something that will capture raw html from a page...

I've attached a document that has pretty pictures and a description.

Simple Solution: Single Page Harvest

Function key responds by saving all html for the current page in a file named

{student number}-{page name}.htm
Example: 214-Attendance.htm

Then later I'd use SED, GREP or AWK to find what I am looking for.

jgpaiva · « **Reply #1 on:** July 17, 2007, 05:46 AM »

I'm sorry, i'm not sure if i understand what you're looking for, but why not just download the site with something like httrack?

mouser · « **Reply #2 on:** July 17, 2007, 10:07 AM »

browsers already have a function in file menu for saving html for a page.. maybe you are asking only for a simple addon or hotkey that will trigger that file save function (and name it a certain way?)

jpcook · « **Reply #3 on:** July 17, 2007, 02:45 PM »

jgpaiva - the real answer is that I'm ignorant, ie. lack of experience and knowledge. And not being particularly bright doesn't' help. Thank you very much for your suggestion. I'lll look at it and it may be just the right thing. Thank you very much!

mouser - EXACTLY..... that would give me the minimal functionality I need. Any ideas or can you point me to a place where I can either have it done or learn how to do it....??

tinjaw · « **Reply #4 on:** July 17, 2007, 03:03 PM »

If you are advanced enough to use "Then later I'd use SED, GREP or AWK to find what I am looking for. " I suggest you look at WGET. It can grab what you need and save it locally.

jpcook · « **Reply #5 on:** July 18, 2007, 06:19 PM »

Thank you also, TJ. Taking a look at WGETand HTTrack..... great packages.
I do appreciate your guys helping me!

crono · « **Reply #6 on:** July 18, 2007, 06:57 PM »

Hi,

HTML is often poorly written. It could be hard to parse if, for example, end-tags are missing. I highly recommend to "sanitize" it with HTML-Tidy before start parsing. Set the "output-xml" option to get well formed XML which could be parsed with any XML-Parser-Libary (DOM/SAX) - this is often easier than using RegEx.

Author Topic: Page data harvest (Read 8754 times)

jpcook

Page data harvest

jgpaiva

Re: Page data harvest

mouser

Re: Page data harvest

jpcook

Re: Page data harvest

tinjaw

Re: Page data harvest

jpcook

Re: Page data harvest

crono

Re: Page data harvest