Author Topic: how to download the text of webpages? (Read 12258 times)

kalos · « **on:** November 20, 2013, 09:50 AM »

hello

I want to download/save the text of a webpage as it appears when I open it in a web-browser (and automate this, as I want to do it for hundreds of webpages)

unfortunately, trying to download the webpage, results in saving the source of the webpage and not its text

for example:
http://www.exportfoo...es/688-abc-nutrition

this webpage in its text has the email [email protected]
but if you try to save the webpage and open the downloaded file, this email is NOT present (it needs to run a javascript to display it, in order to prevent spambots)

any idea how to overcome this?

thanks

4wd · « **Reply #1 on:** November 20, 2013, 08:13 PM »

Control+A then Control+S, Save as Text ... works for me in Pale Moon (Firefox):

Spoiler

Export Food and Drink <http://www.exportfoodanddrink.org>

  * Home <http://www.exportfoodanddrink.org/>
  * FCCA </index.php/fcca>
  * Latest News </index.php/latest-news>
  * Food & Drink Council </index.php/food-and-drink-council>
  * Channel Clusters </index.php/channel-clusters>
  * Celtic Recipes </index.php/celtic-recipes>
  * Membership </index.php/membership>
  * Events </index.php/ev>
  * About IEA </index.php/about-iea>
  * Contact Us </index.php/contact>
  * SEMINARS & TRAINING </index.php/seminars-a-training>
  * LEGISLATION & STANDARDS </index.php/schemesinitiativesstandards>
  * Access 6 </index.php/access-6>

Joomla Slide Menu by DART Creations <http://www.dart-creations.com>

   MEMBERS LOGIN

Name

Password

Remember Me

  * Forgot your password? </index.php/component/user/reset>
  * Forgot your username? </index.php/component/user/remind>

ABC Nutrition

**

*ABC Nutrition Ltd* is an Irish owned and operated contract manufacturer
of nutritional powders, vitamin capsules and spray vitamins. Over the
past few years we have developed a reputation for providing top quality
products in the market place. Our scientifically formulated nutritional
powders and 'can do' attitude have made us the preferred manufacturer of
many well-known sports supplement brands.

*Products*

/Complete Whey (6lbs) & /These are popular supplements amongst athletes
trying to gain muscle. Complete Whey is ideal for use before or after
training as part of your workout program. Each serving contains 23.1g of
muscle enhancing protein. Complete Whey contains high purity whey
protein concentrate isolate and hydrolysate. It contains no added sugar
or Maltodextrin.

Flavours available*:* Strawberry, Chocolate, Vanilla, Chocolate Mint,
Banana, Chocolate & Caramel Cookie, Strawberry & Banana

/Complete Whey 2lbs (908g)/

*Contact:*

ABC Nutrition Ltd.
Unit 7a Knockbeg Point,
Shannon Airport,
Co. Clare,
Ireland.

Tel: +353 (0)61 433010

Fax: +353 (0)61 433012

Mob: +353 (0)86 1775766

Email: [email protected]
<mailto:[email protected]>This e-mail address is being
protected from spambots. You need JavaScript enabled to view it

28 Merrion Square, Dublin 2 | T: +353 1 661 2182 | F: +353 1 661 2315
Website Disclaimer and Privacy Policy
<http://www.exportfoodanddrink.org/index.php/iea-irish-exporters-association-export-food-and-drink-iea-export-food-and-drink/disclaimer>
*Export Food & Drink* | 2010 © All right reserved
E: [email protected]
<mailto:[email protected]> | Site by CCS
<http://www.ccs.ie>

  * Progress </index.php/fcca/progress>
  * Environmental Policy </index.php/fcca/environmental-policy>
  * Expression of Interest in FCCA
   </index.php/fcca/expression-of-interest-in-fcca>
  * Workshops </index.php/fcca/workshops>

  * News & Press Articles </index.php/latest-news/latest-news>

  * Upcoming meetings </index.php/food-and-drink-council/upcoming-meetings>
  * Past meetings </index.php/food-and-drink-council/past-meetings>

  * Participants </index.php/channel-clusters/participants>
  * Progress to date </index.php/channel-clusters/progress-to-date>

  * Progress To Date </index.php/celtic-recipes/progress-to-date>
  * Participants </index.php/celtic-recipes/participants>

  * Become a member </index.php/membership/become-a-member>
  * Member Profiles </index.php/membership/member-profiles>

  * Upcoming Events </index.php/ev/upcoming-events>
  * Past Events </index.php/ev/past-events>

  * About IEA </index.php/about-iea/about-iea>
  * Disclaimer </index.php/about-iea/disclaimer>

  * New Legislation
   </index.php/schemesinitiativesstandards/new-legislation>
  * Existing Schemes/Standards
   </index.php/schemesinitiativesstandards/existing-schemesstandards>

  * Progress </index.php/access-6/progress>
  * Expression Of Interest </index.php/access-6/expression-of-interest>

Email address is there.

There's always my GreaseMonkey userscript for some outright munging of the website before you select/copy.

Other than that, any macro recorder should be able to handle it.

kalos · « **Reply #2 on:** November 21, 2013, 06:04 AM »

I need to do this in an automated way, to load the list of the urls and to produce the same number of their text files

4wd · « **Reply #3 on:** November 21, 2013, 04:41 PM »

You'll probably need to look for a browser addon since it'll have to load, interpret (including JavaScript), and then save, either as HTML or Text.

I don't know of any standalone web ripper that can interpret JavaScript, (to grab protected emails).

If the JavaScript requirement is not critical, then wget will download a list of URLs, and HTMLAsText will convert them all.

eg.
wget.exe -i URLlist.txt -E

wraith808 · « **Reply #4 on:** November 21, 2013, 04:51 PM »

There's software for this type of thing (I've evaluated and made enough of them for work $:-\$ )

What you're looking for is a website scraper. I will say that in my experience, the quality of the scraping is going to be directly proportional to how much you pay... not to much worth anything that's FOSS or free.

Are you looking to pay for this? Or are you looking for something free?

kalos · « **Reply #5 on:** November 22, 2013, 04:31 AM »

Are you looking to pay for this? Or are you looking for something free?
-wraith808 (November 21, 2013, 04:51 PM)

something free would be ideal, but if I have to pay a reasonable amount, it would be okay

patteo · « **Reply #6 on:** November 25, 2013, 02:00 AM »

Are you looking to pay for this? Or are you looking for something free?
-wraith808 (November 21, 2013, 04:51 PM)

something free would be ideal, but if I have to pay a reasonable amount, it would be okay
-kalos (November 22, 2013, 04:31 AM)

Browser Scripting, Data Extraction and Web Testing by iMacros
Use iMacros® 9 to create solutions for web automation, web scraping or web testing in just five minutes.
http://www.iopus.com/imacros/

There's a free Firefox Add-in
https://addons.mozil...imacros-for-firefox/

and Chrome Add-in
https://chrome.googl...fgfijoopmnlemp?hl=en

wraith808 · « **Reply #7 on:** November 25, 2013, 12:49 PM »

Mozenda was favorably reviewed in our trials, and has a free account to try it out with (and that might be good for low throughput). We still actually use it for some stuff.

iRobotSoft is free... it was nowhere what we needed for our uses, but it might work for you.

Automation Anywhere - again, didn't serve our purposes, but it is feature rich. We still actually use it for some low intensity things.

IainB · « **Reply #8 on:** November 25, 2013, 03:06 PM »

Consider Scrapbook: I'm not sure if it is exactly what you want, but the Firefox Scrapbook extension is quite good at scraping individual web pages and those web-pages nested underneath a page. You can tell it how "deep" to go in the nested site, and what files to pick up/ignore as it goes. It tidies all the links up when done, so you have a relatively self-contained copy of the downloaded pages.
I have highlighted in the screenshot below - using red boxes/arrow - the relevant bits of the download option (which pops up on a drag/drop save to a Scrapbook folder):

Scrapbook - 01 Download options.png

Scrapbook is useful where you want to be able to search/retrieve the content easily. Not only can you use Scrapbook to index/search all that it downloads, but also, since it saves the content in a non-proprietary format as html, you can index/search it with the standard Windows Desktop Search and - in my case - xplorer² (a Windows Explorer replacement tool). Any files it creates or downloads also immediately show up in the file/folder search tool Everything.

IainB · « **Reply #9 on:** November 25, 2013, 06:06 PM »

You could use the Mozilla Archive Format extension to save individual pages, or several/all tabs, in either MAFF or MHTML format.
Depending on the constraints, using search/index tools on these files may be problematic as:

The MAFF format files are based on the .ZIP format.
The MHTML format files are based on the MIME format.

When opened, the output of these files is not always likely to be as "exact" a copy as (say) the Scrapbooked pages, and may also differ from browser to browser.
By the way, note that the Scrapbook extension currently only works in Firefox.

tomos · « **Reply #10 on:** November 26, 2013, 05:20 AM »

another possibility here on dc:
Tip: Fast saving of html pages in Firefox by renaming them with a short name

wraith808 · « **Reply #11 on:** November 26, 2013, 11:57 AM »

A lot of this depends on use. If you're downloading the pages as is for that purpose- then these last methods will work. Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data. Perhaps that clarification will help to focus the conversation on the appropriate software.

kalos · « **Reply #12 on:** November 26, 2013, 01:07 PM »

A lot of this depends on use. If you're downloading the pages as is for that purpose- then these last methods will work. Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data. Perhaps that clarification will help to focus the conversation on the appropriate software.
-wraith808 (November 26, 2013, 11:57 AM)

yes, it was for data

Author Topic: how to download the text of webpages? (Read 12258 times)

kalos

how to download the text of webpages?

4wd

Re: how to download the text of webpages?

kalos

Re: how to download the text of webpages?

4wd

Re: how to download the text of webpages?

wraith808

Re: how to download the text of webpages?

kalos

Re: how to download the text of webpages?

patteo

Re: how to download the text of webpages?

wraith808

Re: how to download the text of webpages?

IainB

Re: how to download the text of webpages?

IainB

Re: how to download the text of webpages?

tomos

Re: how to download the text of webpages?

wraith808

Re: how to download the text of webpages?

kalos

Re: how to download the text of webpages?