topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Tuesday October 8, 2024, 8:06 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: how to download the text of webpages?  (Read 12688 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
how to download the text of webpages?
« on: November 20, 2013, 09:50 AM »
hello

I want to download/save the text of a webpage as it appears when I open it in a web-browser (and automate this, as I want to do it for hundreds of webpages)

unfortunately, trying to download the webpage, results in saving the source of the webpage and not its text

for example:
http://www.exportfoo...es/688-abc-nutrition

this webpage in its text has the email [email protected]
but if you try to save the webpage and open the downloaded file, this email is NOT present (it needs to run a javascript to display it, in order to prevent spambots)

any idea how to overcome this?

thanks

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #1 on: November 20, 2013, 08:13 PM »
Control+A then Control+S, Save as Text ... works for me in Pale Moon (Firefox):

Spoiler
Export Food and Drink <http://www.exportfoodanddrink.org>

  * Home <http://www.exportfoodanddrink.org/>
  * FCCA </index.php/fcca>
  * Latest News </index.php/latest-news>
  * Food & Drink Council </index.php/food-and-drink-council>
  * Channel Clusters </index.php/channel-clusters>
  * Celtic Recipes </index.php/celtic-recipes>
  * Membership </index.php/membership>
  * Events </index.php/ev>
  * About IEA </index.php/about-iea>
  * Contact Us </index.php/contact>
  * SEMINARS & TRAINING </index.php/seminars-a-training>
  * LEGISLATION & STANDARDS </index.php/schemesinitiativesstandards>
  * Access 6 </index.php/access-6>

Joomla Slide Menu by DART Creations <http://www.dart-creations.com>


      MEMBERS LOGIN

Name

Password

Remember Me

  * Forgot your password? </index.php/component/user/reset>
  * Forgot your username? </index.php/component/user/remind>

ABC Nutrition

**

 

*ABC Nutrition Ltd* is an Irish owned and operated contract manufacturer
of nutritional powders, vitamin capsules and spray vitamins. Over the
past few years we have developed a reputation for providing top quality
products in the market place. Our scientifically formulated nutritional
powders and 'can do' attitude have made us the preferred manufacturer of
many well-known sports supplement brands.

*Products*

/Complete Whey (6lbs) & /These are popular supplements amongst athletes
trying to gain muscle. Complete Whey is ideal for use before or after
training as part of your workout program. Each serving contains 23.1g of
muscle enhancing protein. Complete Whey contains high purity whey
protein concentrate isolate and hydrolysate. It contains no added sugar
or Maltodextrin.

Flavours available*:* Strawberry, Chocolate, Vanilla, Chocolate Mint,
Banana, Chocolate & Caramel Cookie, Strawberry & Banana

/Complete Whey 2lbs (908g)/

*Contact:*

ABC Nutrition Ltd.
Unit 7a Knockbeg Point,
Shannon Airport,
Co. Clare,
Ireland.

Tel: +353 (0)61 433010

Fax: +353 (0)61 433012

Mob: +353 (0)86 1775766

Email: [email protected]
<mailto:[email protected]>This e-mail address is being
protected from spambots. You need JavaScript enabled to view it

 
28 Merrion Square, Dublin 2 | T: +353 1 661 2182 | F: +353 1 661 2315
Website Disclaimer and Privacy Policy
<http://www.exportfoodanddrink.org/index.php/iea-irish-exporters-association-export-food-and-drink-iea-export-food-and-drink/disclaimer>
*Export Food & Drink* | 2010 © All right reserved
E: [email protected]
<mailto:[email protected]> | Site by CCS
<http://www.ccs.ie>

  * Progress </index.php/fcca/progress>
  * Environmental Policy </index.php/fcca/environmental-policy>
  * Expression of Interest in FCCA
    </index.php/fcca/expression-of-interest-in-fcca>
  * Workshops </index.php/fcca/workshops>

  * News & Press Articles </index.php/latest-news/latest-news>

  * Upcoming meetings </index.php/food-and-drink-council/upcoming-meetings>
  * Past meetings </index.php/food-and-drink-council/past-meetings>

  * Participants </index.php/channel-clusters/participants>
  * Progress to date </index.php/channel-clusters/progress-to-date>

  * Progress To Date </index.php/celtic-recipes/progress-to-date>
  * Participants </index.php/celtic-recipes/participants>

  * Become a member </index.php/membership/become-a-member>
  * Member Profiles </index.php/membership/member-profiles>

  * Upcoming Events </index.php/ev/upcoming-events>
  * Past Events </index.php/ev/past-events>

  * About IEA </index.php/about-iea/about-iea>
  * Disclaimer </index.php/about-iea/disclaimer>

  * New Legislation
    </index.php/schemesinitiativesstandards/new-legislation>
  * Existing Schemes/Standards
    </index.php/schemesinitiativesstandards/existing-schemesstandards>

  * Progress </index.php/access-6/progress>
  * Expression Of Interest </index.php/access-6/expression-of-interest>


Email address is there.

There's always my GreaseMonkey userscript for some outright munging of the website before you select/copy.

Other than that, any macro recorder should be able to handle it.
« Last Edit: November 20, 2013, 08:26 PM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #2 on: November 21, 2013, 06:04 AM »
I need to do this in an automated way, to load the list of the urls and to produce the same number of their text files

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #3 on: November 21, 2013, 04:41 PM »
You'll probably need to look for a browser addon since it'll have to load, interpret (including JavaScript), and then save, either as HTML or Text.

I don't know of any standalone web ripper that can interpret JavaScript, (to grab protected emails).

If the JavaScript requirement is not critical, then wget will download a list of URLs, and HTMLAsText will convert them all.

eg.
wget.exe -i URLlist.txt -E

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #4 on: November 21, 2013, 04:51 PM »
There's software for this type of thing (I've evaluated and made enough of them for work  :-\)

What you're looking for is a website scraper.  I will say that in my experience, the quality of the scraping is going to be directly proportional to how much you pay... not to much worth anything that's FOSS or free.

Are you looking to pay for this?  Or are you looking for something free?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #5 on: November 22, 2013, 04:31 AM »
Are you looking to pay for this?  Or are you looking for something free?

something free would be ideal, but if I have to pay a reasonable amount, it would be okay

patteo

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 437
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: how to download the text of webpages?
« Reply #6 on: November 25, 2013, 02:00 AM »
Are you looking to pay for this?  Or are you looking for something free?

something free would be ideal, but if I have to pay a reasonable amount, it would be okay

Browser Scripting, Data Extraction and Web Testing by iMacros
Use iMacros® 9 to create solutions for web automation, web scraping or web testing in just five minutes.
http://www.iopus.com/imacros/

There's a free Firefox Add-in
https://addons.mozil...imacros-for-firefox/

and Chrome Add-in
https://chrome.googl...fgfijoopmnlemp?hl=en

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #7 on: November 25, 2013, 12:49 PM »
Mozenda was favorably reviewed in our trials, and has a free account to try it out with (and that might be good for low throughput).  We still actually use it for some stuff.

iRobotSoft is free... it was nowhere what we needed for our uses, but it might work for you.

Automation Anywhere - again, didn't serve our purposes, but it is feature rich.  We still actually use it for some low intensity things.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,543
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: how to download the text of webpages?
« Reply #8 on: November 25, 2013, 03:06 PM »
Consider Scrapbook: I'm not sure if it is exactly what you want, but the Firefox Scrapbook extension is quite good at scraping individual web pages and those web-pages nested underneath a page. You can tell it how "deep" to go in the nested site, and what files to pick up/ignore as it goes. It tidies all the links up when done, so you have a relatively self-contained copy of the downloaded pages.
I have highlighted in the screenshot below - using red boxes/arrow - the relevant bits of the download option (which pops up on a drag/drop save to a Scrapbook folder):

Scrapbook - 01 Download options.png

Scrapbook is useful where you want to be able to search/retrieve the content easily. Not only can you use Scrapbook to index/search all that it downloads, but also, since it saves the content in a non-proprietary format as html, you can index/search it with the standard Windows Desktop Search and - in my case - xplorer² (a Windows Explorer replacement tool). Any files it creates or downloads also immediately show up in the file/folder search tool Everything.
« Last Edit: November 25, 2013, 03:13 PM by IainB, Reason: Added note about Everything. »

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,543
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: how to download the text of webpages?
« Reply #9 on: November 25, 2013, 06:06 PM »
You could use the Mozilla Archive Format extension to save individual pages, or several/all tabs, in either MAFF or MHTML format.
Depending on the constraints, using search/index tools on these files may be problematic as:
  • The MAFF format files are based on the .ZIP format.
  • The MHTML format files are based on the MIME format.

MAFF - 01 MAF Options.png

When opened, the output of these files is not always likely to be as "exact" a copy as (say) the Scrapbooked pages, and may also differ from browser to browser.
By the way, note that the Scrapbook extension currently only works in Firefox.

tomos

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 11,963
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #10 on: November 26, 2013, 05:20 AM »
Tom

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #11 on: November 26, 2013, 11:57 AM »
A lot of this depends on use.  If you're downloading the pages as is for that purpose- then these last methods will work.  Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data.  Perhaps that clarification will help to focus the conversation on the appropriate software.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: how to download the text of webpages?
« Reply #12 on: November 26, 2013, 01:07 PM »
A lot of this depends on use.  If you're downloading the pages as is for that purpose- then these last methods will work.  Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data.  Perhaps that clarification will help to focus the conversation on the appropriate software.

yes, it was for data