topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 5:26 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Is there software for this?  (Read 6266 times)

SomebodySmart

  • Participant
  • Joined in 2015
  • *
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
Is there software for this?
« on: June 13, 2015, 01:37 PM »
I go to http://www.pedersonf...home.com/obituaries/ and there's a list of twelve obituaries.

Each has a URL that is in the HTML code and is easy to capture, and the target file is easy to curl or wget,

but I want ALL the hundred of URLs to individual obituaries and I don't want to do work. The NEXT key actually produces a list of the next twelve but the VIEW SOURCE function still lists the first twelve in the source code. Now, is there a product that will download and capture everything one at a time so I can leave the machine on auto-pilot?


ayryq

  • Supporting Member
  • Joined in 2009
  • **
  • Points: 101
  • Posts: 289
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #1 on: June 13, 2015, 05:21 PM »
In a couple minutes of looking, I couldn't find a way. But I'm posting because I used to live a couple blocks from Pederson, in Rockford MI. That's all :)
« Last Edit: June 13, 2015, 05:50 PM by ayryq »

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Re: Is there software for this?
« Reply #2 on: June 13, 2015, 05:52 PM »
well there are a few programs designed to "spider" a page and download all linked pages, images, etc.

one well known one is "Teleport Pro", but there are others.

SomebodySmart

  • Participant
  • Joined in 2015
  • *
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #3 on: June 13, 2015, 07:00 PM »
I looked at Teleport Pro but it doesn't look like it will be able to scan
and download the output of scripts, just static pages.


well there are a few programs designed to "spider" a page and download all linked pages, images, etc.

one well known one is "Teleport Pro", but there are others.

ayryq

  • Supporting Member
  • Joined in 2009
  • **
  • Points: 101
  • Posts: 289
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #4 on: June 13, 2015, 07:18 PM »
I figured it out:

Go to http://www.pedersonf...ies/ObitSearchList/1 (and increment the final number)

Eric

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #5 on: June 13, 2015, 09:12 PM »
Go to http://www.pedersonf...ies/ObitSearchList/1 (and increment the final number)

The last page number is also stored within the source:

Code: HTML [Select]
  1. <input type="hidden" id="totPages" value="83" />

This could probably be done with a GreaseMonkey script that cycles through each page grabbing the links and at the end displaying a page with all of them, which then could be saved using Save Page As ...

Just messing around, this is a heavily modified site scraper from http://blog.nparashu...ascript-firebug.html

Currently it will start at the URL @ayryq mentioned above and load every page until the last one, (requires GreaseMonkey naturally), at a rate of about 1 every 3 seconds.  It also grabs all the URLs from each page but as I haven't worked out how to store them yet, they get overwritten at each page load.
Code: Javascript [Select]
  1. // ==UserScript==
  2. // @name Get The Deadites
  3. // @namespace http://blog.nparashuram.com/2009/08/screen-scraping-with-javascript-firebug.html
  4. // @include http://www.pedersonfuneralhome.com/obituaries/ObitSearchList/*
  5. // ==/UserScript==
  6.  
  7. /*
  8. * Much modified from the original script for a specific site
  9. */
  10.  
  11.  
  12. function loadNextPage(){
  13.   var url = "http://www.pedersonfuneralhome.com/obituaries/ObitSearchList/";
  14.   var num = parseInt(document.location.href.substring(document.location.href.lastIndexOf("/") + 1));
  15.   if (isNaN(num)) {
  16.     num = 1;
  17.   }
  18. // If the counter exceeds the max number of pages we need to stop loading pages otherwise we go energizer bunny.
  19.   if (num < maxPage) {
  20.     document.location = url + (num + 1);
  21. //  } else {
  22. // Reached last page, need to read LocalStore using JSON.parse
  23. // Create document with URLs retreived from LocalStore and open in browser, user can then use Save Page As ...
  24.   }
  25. }
  26.  
  27.  
  28. function start(newlyDeads){
  29. // Need to get previous entries from LocalStore (if exists)
  30. //  var oldDeads = localStorage.getItem('obits');
  31. //  if (typeof oldDeads === undefined) {   // No previous data so just store the new stuff
  32. //    localStorage.setItem('obits', JSON.stringify(newlyDeads));
  33. //  } else {
  34. // Convert to object using JSON.parse
  35. //    var tmpDeads = JSON.parse('oldDeads');
  36. // Merge oldDeads and newlyDeads - new merged object stored in first object argument passed
  37. //    m(tmpDeads, newlyDeads);
  38. // Save back to LocalStore using JSON.stringify
  39. //    localStorage.setItem('obits', JSON.stringify(tmpDeads));
  40. //  }
  41.  
  42. /*
  43. * Dont run a loop, better to run a timeout sort of a function.
  44. * Will not put load on the server
  45. */
  46.   var timerHandler = window.setInterval(function(){
  47.   window.clearInterval(timerHandler);
  48.   window.setTimeout(loadNextPage, 2000);
  49.   }, 1000); // this is the time taken for your next page to load
  50. }
  51.  
  52. // https://gist.github.com/3rd-Eden/988478
  53. // function m(a,b,c){for(c in b)b.hasOwnProperty(c)&&((typeof a[c])[0]=='o'?m(a[c],b[c]):a[c]=b[c])}
  54.  
  55. var maxPage;
  56. var records = document.getElementsByTagName("A");     // Grab all Anchors within page
  57. //delete records[12];                                 // Need to delete "Next" anchor from object (property 13)
  58. var inputs = document.getElementsByTagName("INPUT");  // Grab all the INPUT elements
  59. maxPage = inputs[2].value;                            // Maximum pages is the value of third INPUT tag
  60. start(records);

The comments within the code are what I think should happen but I haven't tested it yet, (mainly because I can't code in Javascript ... but I'm perfectly capable of hitting it with a sledge hammer until it does what I want ... or I give up  :P ).

Someone who actually does know Javascript could probably fill in the big blank areas in record time.
« Last Edit: June 14, 2015, 08:44 AM by 4wd »

SomebodySmart

  • Participant
  • Joined in 2015
  • *
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #6 on: June 14, 2015, 08:49 PM »
Excellent! It looks like now I'll be able to build hyperlinks to every
obituary on every website built by FuneralOne.com ! Thanks.

As for GreaseMonkey, I don't know anything about that, but
I do use curl and my home-made Python 3.2 programs.


Go to http://www.pedersonf...ies/ObitSearchList/1 (and increment the final number)

The last page number is also stored within the source:

Code: HTML [Select]
  1. <input type="hidden" id="totPages" value="83" />

This could probably be done with a GreaseMonkey script that cycles through each page grabbing the links and at the end displaying a page with all of them, which then could be saved using Save Page As ...

Just messing around, this is a heavily modified site scraper from http://blog.nparashu...ascript-firebug.html

Currently it will start at the URL @ayryq mentioned above and load every page until the last one, (requires GreaseMonkey naturally), at a rate of about 1 every 3 seconds.  It also grabs all the URLs from each page but as I haven't worked out how to store them yet, they get overwritten at each page load.
Code: Javascript [Select]
  1. // ==UserScript==
  2. // @name Get The Deadites
  3. // @namespace http://blog.nparashuram.com/2009/08/screen-scraping-with-javascript-firebug.html
  4. // @include http://www.pedersonfuneralhome.com/obituaries/ObitSearchList/*
  5. // ==/UserScript==
  6.  
  7. /*
  8. * Much modified from the original script for a specific site
  9. */
  10.  
  11.  
  12. function loadNextPage(){
  13.   var url = "http://www.pedersonfuneralhome.com/obituaries/ObitSearchList/";
  14.   var num = parseInt(document.location.href.substring(document.location.href.lastIndexOf("/") + 1));
  15.   if (isNaN(num)) {
  16.     num = 1;
  17.   }
  18. // If the counter exceeds the max number of pages we need to stop loading pages otherwise we go energizer bunny.
  19.   if (num < maxPage) {
  20.     document.location = url + (num + 1);
  21. //  } else {
  22. // Reached last page, need to read LocalStore using JSON.parse
  23. // Create document with URLs retreived from LocalStore and open in browser, user can then use Save Page As ...
  24.   }
  25. }
  26.  
  27.  
  28. function start(newlyDeads){
  29. // Need to get previous entries from LocalStore (if exists)
  30. //  var oldDeads = localStorage.getItem('obits');
  31. //  if (typeof oldDeads === undefined) {   // No previous data so just store the new stuff
  32. //    localStorage.setItem('obits', JSON.stringify(newlyDeads));
  33. //  } else {
  34. // Convert to object using JSON.parse
  35. //    var tmpDeads = JSON.parse('oldDeads');
  36. // Merge oldDeads and newlyDeads - new merged object stored in first object argument passed
  37. //    m(tmpDeads, newlyDeads);
  38. // Save back to LocalStore using JSON.stringify
  39. //    localStorage.setItem('obits', JSON.stringify(tmpDeads));
  40. //  }
  41.  
  42. /*
  43. * Dont run a loop, better to run a timeout sort of a function.
  44. * Will not put load on the server
  45. */
  46.   var timerHandler = window.setInterval(function(){
  47.   window.clearInterval(timerHandler);
  48.   window.setTimeout(loadNextPage, 2000);
  49.   }, 1000); // this is the time taken for your next page to load
  50. }
  51.  
  52. // https://gist.github.com/3rd-Eden/988478
  53. // function m(a,b,c){for(c in b)b.hasOwnProperty(c)&&((typeof a[c])[0]=='o'?m(a[c],b[c]):a[c]=b[c])}
  54.  
  55. var maxPage;
  56. var records = document.getElementsByTagName("A");     // Grab all Anchors within page
  57. //delete records[12];                                 // Need to delete "Next" anchor from object (property 13)
  58. var inputs = document.getElementsByTagName("INPUT");  // Grab all the INPUT elements
  59. maxPage = inputs[2].value;                            // Maximum pages is the value of third INPUT tag
  60. start(records);

The comments within the code are what I think should happen but I haven't tested it yet, (mainly because I can't code in Javascript ... but I'm perfectly capable of hitting it with a sledge hammer until it does what I want ... or I give up  :P ).

Someone who actually does know Javascript could probably fill in the big blank areas in record time.

ayryq

  • Supporting Member
  • Joined in 2009
  • **
  • Points: 101
  • Posts: 289
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #7 on: June 15, 2015, 07:26 AM »
So... what are you doing, anyway?


SomebodySmart

  • Participant
  • Joined in 2015
  • *
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
Re: Is there software for this?
« Reply #8 on: June 15, 2015, 06:17 PM »
So... what are you doing, anyway?



I'm building a genealogy website that will help persons trace their family trees. There's a lot of genealogy info in obituaries. I cannot copy the obituaries onto my new website for obvious copyright reasons but I can help persons find those obituaries on the funeral home and newspaper websites for free.