Author Topic: get the text from html (Read 10663 times)

kalos · « **on:** February 28, 2008, 04:15 PM »

hello

I have some html files

is it possible to use a program to get all the text only of the html? as if I open the html with a browser, then click ctrl+a and then copy paste all the selected text

thanks

PS: I need to do this in batch

Veign · « **Reply #1 on:** February 28, 2008, 04:24 PM »

Online tool:
http://www.zubrag.co...ml-tags-stripper.php

Probably uses a PHP function:
http://us2.php.net/strip-tags
(I bet some could write a batch upload tool for this - would probably take me an hour and if I had time I would do it)

Possibly automate with HTML Tidy:
http://tidy.sourcefo...t/docs/Overview.html

urlwolf · « **Reply #2 on:** February 28, 2008, 04:35 PM »

there are many ways to do this in batch.
(1) regular expression. Not the best way; very prone to error if html is malformed.

(2) Parsing the html with a specialized parser, e.g., perl's HTML:Tree or Ruby's REXML. More accurate, but still a pain for a simple plain text dump.

(3) (recommended) use a text-only browser (e.g., lynx, links). Pipe the whole set of files into say lynx. It has an option to dump text files (I think it was -d). This is the easiest and most robust.

You need to download lynx from: lynx.browser.org/

HTH

Renegade · « **Reply #3 on:** February 29, 2008, 09:29 AM »

http://www.4guysfrom...btech/042501-1.shtml

http://weblogs.asp.n...2003/05/13/6963.aspx

http://www.databasej.../article.php/3494196

There are more of course. But those are techical.

The easy way is to copy to the clipboard, then paste as text. That can be automated as well.

kalos · « **Reply #4 on:** February 29, 2008, 10:06 AM »

The easy way is to copy to the clipboard, then paste as text. That can be automated as well.
-Renegade (February 29, 2008, 09:29 AM)

this is what I want

as for lynx, I am struggling to learn to use it

kalos · « **Reply #5 on:** July 09, 2008, 09:49 AM »

is there any update on this function?

I am having hard time to do it

and it should support multiple encodings

kimmchii · « **Reply #6 on:** July 10, 2008, 09:53 PM »

you can also merge/join the html files then copy and paste at once.

just look for freeware merging/join prog.

Target · « **Reply #7 on:** July 10, 2008, 11:04 PM »

not automated, but....

look at some of the html editors (and possible some of the better text editors??) - a lot of them have functions that will strip out any tags so you'll have a clean text file...

Curt · « **Reply #8 on:** July 11, 2008, 06:53 AM »

Have can it be done in batch if he needs only the selected text? If I misunderstood this, and the text doesn't have to be marked, but the need is to convert a number of homepages to text, then merely google HTML2TXT - there are plenty to choose from.

This one is quite old:

What is HTML2TXT

HTML2TXT lets you you convert HTML files to TXT format. It removes the HTML tags and also reformats the text, based on the web site layout. You can convert multiple files or entire folders in a single run.

freeware.
http://www.bobsoft.com/html2txt/

Author Topic: get the text from html (Read 10663 times)

kalos

get the text from html

Veign

Re: get the text from html

urlwolf

Re: get the text from html

Renegade

Re: get the text from html

kalos

Re: get the text from html

kalos

Re: get the text from html

kimmchii

Re: get the text from html

Target

Re: get the text from html

Curt

Re: get the text from html