topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 29, 2024, 7:07 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: get the text from html  (Read 9031 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
get the text from html
« on: February 28, 2008, 04:15 PM »
hello

I have some html files

is it possible to use a program to get all the text only of the html? as if I open the html with a browser, then click ctrl+a and then copy paste all the selected text

thanks

PS: I need to do this in batch

Veign

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 993
    • View Profile
    • Veign - Where design meets development
    • Donate to Member
Re: get the text from html
« Reply #1 on: February 28, 2008, 04:24 PM »
Online tool:
http://www.zubrag.co...ml-tags-stripper.php

Probably uses a PHP function:
http://us2.php.net/strip-tags
(I bet some could write a batch upload tool for this - would probably take me an hour and if I had time I would do it)

Possibly automate with HTML Tidy:
http://tidy.sourcefo...t/docs/Overview.html

urlwolf

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 1,837
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #2 on: February 28, 2008, 04:35 PM »
there are many ways to do this in batch.
(1) regular expression. Not the best way; very prone to error if html is malformed.

(2) Parsing the html with a specialized parser, e.g., perl's HTML:Tree or Ruby's REXML. More accurate, but still a pain for a simple plain text dump.

(3) (recommended) use a text-only browser (e.g., lynx, links). Pipe the whole set of files into say lynx. It has an option to dump text files (I think it was -d). This is the easiest and most robust.

You need to download lynx from: lynx.browser.org/

HTH

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 13,288
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: get the text from html
« Reply #3 on: February 29, 2008, 09:29 AM »
http://www.4guysfrom...btech/042501-1.shtml

http://weblogs.asp.n...2003/05/13/6963.aspx

http://www.databasej.../article.php/3494196

There are more of course. But those are techical.

The easy way is to copy to the clipboard, then paste as text. That can be automated as well.
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #4 on: February 29, 2008, 10:06 AM »
The easy way is to copy to the clipboard, then paste as text. That can be automated as well.

this is what I want

as for lynx, I am struggling to learn to use it

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #5 on: July 09, 2008, 09:49 AM »
is there any update on this function?

I am having hard time to do it

and it should support multiple encodings

kimmchii

  • Honorary Member
  • Joined in 2005
  • **
  • Posts: 360
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #6 on: July 10, 2008, 09:53 PM »
you can also merge/join the html files then copy and paste at once.

just look for freeware merging/join prog.
If you find a good solution and become attached to it, the solution may become your next problem.
~Robert Anthony

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,832
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #7 on: July 10, 2008, 11:04 PM »
not automated, but....

look at some of the html editors (and possible some of the better text editors??) - a lot of them have functions that will strip out any tags so you'll have a clean text file...

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
Re: get the text from html
« Reply #8 on: July 11, 2008, 06:53 AM »
Have can it be done in batch if he needs only the selected text?  If I misunderstood this, and the text doesn't have to be marked, but the need is to convert a number of homepages to text, then merely google HTML2TXT - there are plenty to choose from.

This one is quite old:

What is HTML2TXT

HTML2TXT lets you you convert HTML files to TXT format. It removes the HTML tags and also reformats the text, based on the web site layout. You can convert multiple files or entire folders in a single run.

freeware.
http://www.bobsoft.com/html2txt/
« Last Edit: July 11, 2008, 04:20 PM by Curt »