ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

how to download the text of webpages?

(1/3) > >>

kalos:
hello

I want to download/save the text of a webpage as it appears when I open it in a web-browser (and automate this, as I want to do it for hundreds of webpages)

unfortunately, trying to download the webpage, results in saving the source of the webpage and not its text

for example:
http://www.exportfoodanddrink.org/index.php/membership/member-profiles/688-abc-nutrition

this webpage in its text has the email info@abcsportsnutrition.com
but if you try to save the webpage and open the downloaded file, this email is NOT present (it needs to run a javascript to display it, in order to prevent spambots)

any idea how to overcome this?

thanks

4wd:
Control+A then Control+S, Save as Text ... works for me in Pale Moon (Firefox):

SpoilerExport Food and Drink <http://www.exportfoodanddrink.org>

  * Home <http://www.exportfoodanddrink.org/>
  * FCCA </index.php/fcca>
  * Latest News </index.php/latest-news>
  * Food & Drink Council </index.php/food-and-drink-council>
  * Channel Clusters </index.php/channel-clusters>
  * Celtic Recipes </index.php/celtic-recipes>
  * Membership </index.php/membership>
  * Events </index.php/ev>
  * About IEA </index.php/about-iea>
  * Contact Us </index.php/contact>
  * SEMINARS & TRAINING </index.php/seminars-a-training>
  * LEGISLATION & STANDARDS </index.php/schemesinitiativesstandards>
  * Access 6 </index.php/access-6>

Joomla Slide Menu by DART Creations <http://www.dart-creations.com>


      MEMBERS LOGIN

Name

Password

Remember Me

  * Forgot your password? </index.php/component/user/reset>
  * Forgot your username? </index.php/component/user/remind>

ABC Nutrition

**

 

*ABC Nutrition Ltd* is an Irish owned and operated contract manufacturer
of nutritional powders, vitamin capsules and spray vitamins. Over the
past few years we have developed a reputation for providing top quality
products in the market place. Our scientifically formulated nutritional
powders and 'can do' attitude have made us the preferred manufacturer of
many well-known sports supplement brands.

*Products*

/Complete Whey (6lbs) & /These are popular supplements amongst athletes
trying to gain muscle. Complete Whey is ideal for use before or after
training as part of your workout program. Each serving contains 23.1g of
muscle enhancing protein. Complete Whey contains high purity whey
protein concentrate isolate and hydrolysate. It contains no added sugar
or Maltodextrin.

Flavours available*:* Strawberry, Chocolate, Vanilla, Chocolate Mint,
Banana, Chocolate & Caramel Cookie, Strawberry & Banana

/Complete Whey 2lbs (908g)/

*Contact:*

ABC Nutrition Ltd.
Unit 7a Knockbeg Point,
Shannon Airport,
Co. Clare,
Ireland.

Tel: +353 (0)61 433010

Fax: +353 (0)61 433012

Mob: +353 (0)86 1775766

Email: info@abcsportsnutrition.com
<mailto:info@abcsportsnutrition.com>This e-mail address is being
protected from spambots. You need JavaScript enabled to view it

 
28 Merrion Square, Dublin 2 | T: +353 1 661 2182 | F: +353 1 661 2315
Website Disclaimer and Privacy Policy
<http://www.exportfoodanddrink.org/index.php/iea-irish-exporters-association-export-food-and-drink-iea-export-food-and-drink/disclaimer>
*Export Food & Drink* | 2010 © All right reserved
E: exportfoodanddrink@irishexporters.ie
<mailto:exportfoodanddrink@irishexporters.ie> | Site by CCS
<http://www.ccs.ie>

  * Progress </index.php/fcca/progress>
  * Environmental Policy </index.php/fcca/environmental-policy>
  * Expression of Interest in FCCA
    </index.php/fcca/expression-of-interest-in-fcca>
  * Workshops </index.php/fcca/workshops>

  * News & Press Articles </index.php/latest-news/latest-news>

  * Upcoming meetings </index.php/food-and-drink-council/upcoming-meetings>
  * Past meetings </index.php/food-and-drink-council/past-meetings>

  * Participants </index.php/channel-clusters/participants>
  * Progress to date </index.php/channel-clusters/progress-to-date>

  * Progress To Date </index.php/celtic-recipes/progress-to-date>
  * Participants </index.php/celtic-recipes/participants>

  * Become a member </index.php/membership/become-a-member>
  * Member Profiles </index.php/membership/member-profiles>

  * Upcoming Events </index.php/ev/upcoming-events>
  * Past Events </index.php/ev/past-events>

  * About IEA </index.php/about-iea/about-iea>
  * Disclaimer </index.php/about-iea/disclaimer>

  * New Legislation
    </index.php/schemesinitiativesstandards/new-legislation>
  * Existing Schemes/Standards
    </index.php/schemesinitiativesstandards/existing-schemesstandards>

  * Progress </index.php/access-6/progress>
  * Expression Of Interest </index.php/access-6/expression-of-interest>

Email address is there.

There's always my GreaseMonkey userscript for some outright munging of the website before you select/copy.

Other than that, any macro recorder should be able to handle it.

kalos:
I need to do this in an automated way, to load the list of the urls and to produce the same number of their text files

4wd:
You'll probably need to look for a browser addon since it'll have to load, interpret (including JavaScript), and then save, either as HTML or Text.

I don't know of any standalone web ripper that can interpret JavaScript, (to grab protected emails).

If the JavaScript requirement is not critical, then wget will download a list of URLs, and HTMLAsText will convert them all.

eg.
wget.exe -i URLlist.txt -E

wraith808:
There's software for this type of thing (I've evaluated and made enough of them for work  :-\)

What you're looking for is a website scraper.  I will say that in my experience, the quality of the scraping is going to be directly proportional to how much you pay... not to much worth anything that's FOSS or free.

Are you looking to pay for this?  Or are you looking for something free?

Navigation

[0] Message Index

[#] Next page

Go to full version