topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday April 19, 2024, 10:44 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: a program to batch-convert html to plain text?  (Read 8616 times)

urlwolf

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 1,837
    • View Profile
    • Donate to Member
a program to batch-convert html to plain text?
« on: May 06, 2006, 01:20 PM »
This is something that I needed long time ago and now I'm needing again.
I wish there were a program to batch-process html to text in an easy way,
stripping tables, ads, sitemaps, frames, etc and getting only article text.

Since there is a huge variability in how people use html and each site has a
different formatting, a general solution is difficult I guess. I used to use
perl + some parsing modules, but the solutions would break easily (a bit more
robust than simple regular expressions, but still).

Then, I think I moved to use dumps from text-based browers. 'Links' actually
supports frames, which is nice. The most popular one is 'lynx'.

http://links.sourcef...t/download/binaries/

Still, it is a lot of work and it needs fine-tuning.

Do you know of any program that offers that, together with a GUI and some other
niceties? There must be one out there...
Thanks a lot

369

  • Participant
  • Joined in 2006
  • *
  • default avatar
  • Posts: 8
    • View Profile
    • Donate to Member

gjehle

  • Member
  • Joined in 2006
  • **
  • Posts: 286
  • lonesome linux warrior
    • View Profile
    • Open Source Corner
    • Read more about this member.
    • Donate to Member
Re: a program to batch-convert html to plain text?
« Reply #2 on: May 06, 2006, 03:20 PM »
maybe try this one
http://html2text.sourceforge.net/

besides there should be enough perl modules too

and yes, links2 does it too

easy to get a batch script using those

tinyvillager

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 444
    • View Profile
    • Donate to Member
Re: a program to batch-convert html to plain text?
« Reply #3 on: May 06, 2006, 04:01 PM »
Source:http://www.highdots.com/html-code-export/


"Copy and Paste"


Looking for a simple and fast way to indent and export your HTML code into various file formats? Look no further than HTML Code Export, a unique and easy to use software to quickly and easily reindent, export (10+ formats supported) and print your HTML documents, convert them to PDF, RTF, images and more!

Convert your HTML code to the following formats :
HTML
PDF
RTF
BMP
PNG
JPG
Lotus
SVG
QUATTRO Pro
Excel


It's freeware too. :Thmbsup:

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,900
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Re: a program to batch-convert html to plain text?
« Reply #4 on: May 06, 2006, 05:05 PM »
nice find tinyvillager!
other good software from that company as well: http://www.highdots.com/

( sending a couple credits your way :) )

urlwolf

  • Charter Member
  • Joined in 2006
  • ***
  • Posts: 1,837
    • View Profile
    • Donate to Member
Re: a program to batch-convert html to plain text?
« Reply #5 on: May 06, 2006, 08:26 PM »
Thanks for the answers.
Here is what I have found after a brief test.
web2text was last updated... exactly at the end of last century. It leaves tags
behind and could not produce a passable output with the html that I used.

html2text is written in visual basic (of all possible languages, that would be
my LAST option for this job!) and is really slow. I can actually see a progress
bar moving when the conversion should be instantaneous. Not suited for batch
processing, but comes with code, so no problem. The output is ok, not great. For
example, it misses paragraph separators and returns a compact block of text. Not
ready for prime use.

html code export actually doesn't export to plain text. It looks very professional and
useful, but not for my purpose.

links (.99) for the terminal doesn't have flags to dump to text. It was lynx
actually, I had them confused. The win versions that I have tried didn't work
and it looks a bit like abandonware (most sites that hosted it are now broken
links, at least for windows)

So I still haven't found what I'm looking for :)
Thanks a lot