Welcome Guest.   Make a donation to an author on the site September 30, 2014, 02:55:31 PM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2010! Download 24 custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: how to download the text of webpages?  (Read 3086 times)
kalos
Member
**
Posts: 1,023

View Profile Give some DonationCredits to this forum member
« on: November 20, 2013, 09:50:41 AM »

hello

I want to download/save the text of a webpage as it appears when I open it in a web-browser (and automate this, as I want to do it for hundreds of webpages)

unfortunately, trying to download the webpage, results in saving the source of the webpage and not its text

for example:
http://www.exportfoodandd...rofiles/688-abc-nutrition

this webpage in its text has the email info@abcsportsnutrition.com
but if you try to save the webpage and open the downloaded file, this email is NOT present (it needs to run a javascript to display it, in order to prevent spambots)

any idea how to overcome this?

thanks
Logged
4wd
Supporting Member
**
Posts: 3,337



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #1 on: November 20, 2013, 08:13:06 PM »

Control+A then Control+S, Save as Text ... works for me in Pale Moon (Firefox):


Email address is there.

There's always my GreaseMonkey userscript for some outright munging of the website before you select/copy.

Other than that, any macro recorder should be able to handle it.
« Last Edit: November 20, 2013, 08:26:14 PM by 4wd » Logged

I do not need to control my anger ... people just need to stop pissing me off!
kalos
Member
**
Posts: 1,023

View Profile Give some DonationCredits to this forum member
« Reply #2 on: November 21, 2013, 06:04:22 AM »

I need to do this in an automated way, to load the list of the urls and to produce the same number of their text files
Logged
4wd
Supporting Member
**
Posts: 3,337



see users location on a map View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #3 on: November 21, 2013, 04:41:36 PM »

You'll probably need to look for a browser addon since it'll have to load, interpret (including JavaScript), and then save, either as HTML or Text.

I don't know of any standalone web ripper that can interpret JavaScript, (to grab protected emails).

If the JavaScript requirement is not critical, then wget will download a list of URLs, and HTMLAsText will convert them all.

eg.
wget.exe -i URLlist.txt -E
Logged

I do not need to control my anger ... people just need to stop pissing me off!
wraith808
Supporting Member
**
Posts: 6,334



"In my dreams, I always do it right."

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #4 on: November 21, 2013, 04:51:41 PM »

There's software for this type of thing (I've evaluated and made enough of them for work  undecided)

What you're looking for is a website scraper.  I will say that in my experience, the quality of the scraping is going to be directly proportional to how much you pay... not to much worth anything that's FOSS or free.

Are you looking to pay for this?  Or are you looking for something free?
Logged

kalos
Member
**
Posts: 1,023

View Profile Give some DonationCredits to this forum member
« Reply #5 on: November 22, 2013, 04:31:34 AM »

Are you looking to pay for this?  Or are you looking for something free?

something free would be ideal, but if I have to pay a reasonable amount, it would be okay
Logged
patteo
Charter Member
***
Posts: 436


View Profile Read user's biography. Give some DonationCredits to this forum member
« Reply #6 on: November 25, 2013, 02:00:29 AM »

Are you looking to pay for this?  Or are you looking for something free?

something free would be ideal, but if I have to pay a reasonable amount, it would be okay

Browser Scripting, Data Extraction and Web Testing by iMacros
Use iMacros® 9 to create solutions for web automation, web scraping or web testing in just five minutes.
http://www.iopus.com/imacros/

There's a free Firefox Add-in
https://addons.mozilla.or...ddon/imacros-for-firefox/

and Chrome Add-in
https://chrome.google.com...jogncfgfijoopmnlemp?hl=en
Logged
wraith808
Supporting Member
**
Posts: 6,334



"In my dreams, I always do it right."

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #7 on: November 25, 2013, 12:49:57 PM »

Mozenda was favorably reviewed in our trials, and has a free account to try it out with (and that might be good for low throughput).  We still actually use it for some stuff.

iRobotSoft is free... it was nowhere what we needed for our uses, but it might work for you.

Automation Anywhere - again, didn't serve our purposes, but it is feature rich.  We still actually use it for some low intensity things.
Logged

IainB
Supporting Member
**
Posts: 4,753


Slartibartfarst

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #8 on: November 25, 2013, 03:06:23 PM »

Consider Scrapbook: I'm not sure if it is exactly what you want, but the Firefox Scrapbook extension is quite good at scraping individual web pages and those web-pages nested underneath a page. You can tell it how "deep" to go in the nested site, and what files to pick up/ignore as it goes. It tidies all the links up when done, so you have a relatively self-contained copy of the downloaded pages.
I have highlighted in the screenshot below - using red boxes/arrow - the relevant bits of the download option (which pops up on a drag/drop save to a Scrapbook folder):



Scrapbook is useful where you want to be able to search/retrieve the content easily. Not only can you use Scrapbook to index/search all that it downloads, but also, since it saves the content in a non-proprietary format as html, you can index/search it with the standard Windows Desktop Search and - in my case - xplorer² (a Windows Explorer replacement tool). Any files it creates or downloads also immediately show up in the file/folder search tool Everything.
« Last Edit: November 25, 2013, 03:13:07 PM by IainB; Reason: Added note about Everything. » Logged
IainB
Supporting Member
**
Posts: 4,753


Slartibartfarst

see users location on a map View Profile Give some DonationCredits to this forum member
« Reply #9 on: November 25, 2013, 06:06:01 PM »

You could use the Mozilla Archive Format extension to save individual pages, or several/all tabs, in either MAFF or MHTML format.
Depending on the constraints, using search/index tools on these files may be problematic as:
  • The MAFF format files are based on the .ZIP format.
  • The MHTML format files are based on the MIME format.



When opened, the output of these files is not always likely to be as "exact" a copy as (say) the Scrapbooked pages, and may also differ from browser to browser.
By the way, note that the Scrapbook extension currently only works in Firefox.
Logged
tomos
Charter Member
***
Posts: 8,544



see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #10 on: November 26, 2013, 05:20:50 AM »

another possibility here on dc:
Tip: Fast saving of html pages in Firefox by renaming them with a short name
Logged

Tom
wraith808
Supporting Member
**
Posts: 6,334



"In my dreams, I always do it right."

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #11 on: November 26, 2013, 11:57:39 AM »

A lot of this depends on use.  If you're downloading the pages as is for that purpose- then these last methods will work.  Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data.  Perhaps that clarification will help to focus the conversation on the appropriate software.
Logged

kalos
Member
**
Posts: 1,023

View Profile Give some DonationCredits to this forum member
« Reply #12 on: November 26, 2013, 01:07:26 PM »

A lot of this depends on use.  If you're downloading the pages as is for that purpose- then these last methods will work.  Since the OP said d/l the text of the web pages in bulk, I'd assume that it wasn't for browsing, but rather for the data.  Perhaps that clarification will help to focus the conversation on the appropriate software.

yes, it was for data
Logged
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.068s | Server load: 0.06 ]