topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 5:28 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: HTML to XHTML conversion, what's most effective?  (Read 3770 times)

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
HTML to XHTML conversion, what's most effective?
« on: October 02, 2009, 12:17 PM »
What would be the most effective and free solution to convert a bunch of html 4.0 files to pure xhtml? Ideally without having to open each file individually in some editor? (Not an online service either, because it will take ages to do for a lot of files.)

I've been trying Html Tidy, but I can't figure out how to make it output utf-8 or even ISO codepage characters. Whatever settings, I try, it insists on outputting diacritic ("high ascii") characters as entities, such as {.

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: HTML to XHTML conversion, what's most effective?
« Reply #1 on: October 02, 2009, 02:31 PM »
Well, I was going to suggest tidy.  :-[

I still think it is you best choice and you should probably just see if Google can get you the correct command line voodoo you need.

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: HTML to XHTML conversion, what's most effective?
« Reply #2 on: October 02, 2009, 04:07 PM »
Tidy has all sorts of options for input and output encoding, yet it always gives me entities. And among the software I have, TopStyle, CSE Validator and HTMLPad, all can do the conversion, and all rely on Tidy :)

it's off to Google then!

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: HTML to XHTML conversion, what's most effective?
« Reply #3 on: October 02, 2009, 08:40 PM »
Ah. Case solved. Unless it's utf-8, Tidy only speaks English. No matter that there are perfectly valid encodings such as the iso-8859-* line. Tidy will either output entities or do the worst thing imaginable: reduce é to e, ą to a, ź to z... etc, for any character above ascii 127.

Tidy should come with a special warning label. It can make a grown linguist cry :'(

The problem was actually compounded by what may be a bug in TopStyle 4. Tidy does the right thing if input is utf-8, which TopStyle officially supports now. Yet for some reason when data comes back from Tidy to TopStyle, you get "raw" utf-8, also known as garbage. If you save the changes, it's search and replace next. But it's not Tidy's fault, since HTMLPad 2008 manages to get utf-8 there and back cleanly.

Now I only need to convert a batch of files from iso-8859-2 to utf-8, *and* remove the meta charset declaration from all the files first, yay! :)

(Moral of the story: someone who speaks Python or Perl could probably achieve in three minutes what's taken me half a day already.)

housetier

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • default avatar
  • Posts: 1,321
    • View Profile
    • Donate to Member
Re: HTML to XHTML conversion, what's most effective?
« Reply #4 on: October 02, 2009, 08:51 PM »
From the man page and "tidy -show-config" I would suggest trying these command-line options:

-asxhtml convert HTML to well formed XHTML
-output-encoding utf8

(Maybe the output-encoding option has to be written in a config file, I am not sure)

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: HTML to XHTML conversion, what's most effective?
« Reply #5 on: October 02, 2009, 09:18 PM »
-asxhtml convert HTML to well formed XHTML
-output-encoding utf8

Tidy will output utf-8 only if input is utf-8 as well, or - if I understand the docs correctly - if input is one of the charsets it supports, that is ascii, Latin1, iso-8859-1 and a few more exotic ones. Other input charsets are explicitly documented as unsupported: Tidy will accept vendor specific character values, but will use entities for all characters whose value > 127: http://tidy.sourcefo...f.html#char-encoding

(I'm only surprised that for non-Western charsets the "raw" switch doesn't do what you'd think it would. The doc referenced above reads For raw, Tidy will output values above 127 without translating them into entities., but that's not what happened. Tidy "normalized" everything above 127 to the 0-127 range, thus corrupting data, if input was iso-8859-2 or Windows-1250, i.e. Central European. Seems like utf-8 on both sides of Tody is the only way to go.)

I've got it now though ;)
« Last Edit: October 02, 2009, 09:20 PM by tranglos »