topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday April 19, 2024, 1:51 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Page data harvest  (Read 8754 times)

jpcook

  • Participant
  • Joined in 2007
  • *
  • default avatar
  • Posts: 5
    • View Profile
    • Donate to Member
Page data harvest
« on: July 16, 2007, 09:52 PM »
Very frequently I would like the ability to harvest data or some information from a page.
Something like Snagit, but simple, less expensive and something that will capture raw html from a page...

I've attached a document that has pretty pictures and a description.


Simple Solution: Single Page Harvest

Function key responds by saving all html for the current page in a file named

{student number}-{page name}.htm
Example: 214-Attendance.htm

Then later I'd use SED, GREP or AWK to find what I am looking for.

jgpaiva

  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 4,727
    • View Profile
    • Donate to Member
Re: Page data harvest
« Reply #1 on: July 17, 2007, 05:46 AM »
I'm sorry, i'm not sure if i understand what you're looking for, but why not just download the site with something like httrack?

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,900
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Re: Page data harvest
« Reply #2 on: July 17, 2007, 10:07 AM »
browsers already have a function in file menu for saving html for a page.. maybe you are asking only for a simple addon or hotkey that will trigger that file save function (and name it a certain way?)

jpcook

  • Participant
  • Joined in 2007
  • *
  • default avatar
  • Posts: 5
    • View Profile
    • Donate to Member
Re: Page data harvest
« Reply #3 on: July 17, 2007, 02:45 PM »
jgpaiva - the real answer is that I'm ignorant, ie. lack of experience and knowledge. And not being particularly bright doesn't' help.  Thank you very much for your suggestion. I'lll look at it and it may be just the right thing. Thank you very much! :D


mouser - EXACTLY..... that would give me the minimal functionality I need. Any ideas or can you point me to a place where I can either have it done or learn how to do it....??





tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Page data harvest
« Reply #4 on: July 17, 2007, 03:03 PM »
If you are advanced enough to use "Then later I'd use SED, GREP or AWK to find what I am looking for. " I suggest you look at WGET. It can grab what you need and save it locally.

jpcook

  • Participant
  • Joined in 2007
  • *
  • default avatar
  • Posts: 5
    • View Profile
    • Donate to Member
Re: Page data harvest
« Reply #5 on: July 18, 2007, 06:19 PM »
Thank you also, TJ. Taking a look at WGETand HTTrack..... great packages. 
I do appreciate your guys helping me!  :Thmbsup:

crono

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 179
    • View Profile
    • Donate to Member
Re: Page data harvest
« Reply #6 on: July 18, 2007, 06:57 PM »
Hi,

HTML is often poorly written. It could be hard to parse if, for example, end-tags are missing. I highly recommend to "sanitize" it with HTML-Tidy before start parsing. Set the "output-xml" option to get well formed XML which could be parsed with any XML-Parser-Libary (DOM/SAX) - this is often easier than using RegEx.