ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

get the text from html

(1/2) > >>

kalos:
hello

I have some html files

is it possible to use a program to get all the text only of the html? as if I open the html with a browser, then click ctrl+a and then copy paste all the selected text

thanks

PS: I need to do this in batch

Veign:
Online tool:
http://www.zubrag.com/tools/html-tags-stripper.php

Probably uses a PHP function:
http://us2.php.net/strip-tags
(I bet some could write a batch upload tool for this - would probably take me an hour and if I had time I would do it)

Possibly automate with HTML Tidy:
http://tidy.sourceforge.net/docs/Overview.html

urlwolf:
there are many ways to do this in batch.
(1) regular expression. Not the best way; very prone to error if html is malformed.

(2) Parsing the html with a specialized parser, e.g., perl's HTML:Tree or Ruby's REXML. More accurate, but still a pain for a simple plain text dump.

(3) (recommended) use a text-only browser (e.g., lynx, links). Pipe the whole set of files into say lynx. It has an option to dump text files (I think it was -d). This is the easiest and most robust.

You need to download lynx from: lynx.browser.org/

HTH

Renegade:
http://www.4guysfromrolla.com/webtech/042501-1.shtml

http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx

http://www.databasejournal.com/scripts/article.php/3494196

There are more of course. But those are techical.

The easy way is to copy to the clipboard, then paste as text. That can be automated as well.

kalos:
The easy way is to copy to the clipboard, then paste as text. That can be automated as well.
-Renegade (February 29, 2008, 09:29 AM)
--- End quote ---

this is what I want

as for lynx, I am struggling to learn to use it

Navigation

[0] Message Index

[#] Next page

Go to full version