Author Topic: HTML to XHTML conversion, what's most effective? (Read 4876 times)

tranglos · « **on:** October 02, 2009, 12:17 PM »

What would be the most effective and free solution to convert a bunch of html 4.0 files to pure xhtml? Ideally without having to open each file individually in some editor? (Not an online service either, because it will take ages to do for a lot of files.)

I've been trying Html Tidy, but I can't figure out how to make it output utf-8 or even ISO codepage characters. Whatever settings, I try, it insists on outputting diacritic ("high ascii") characters as entities, such as {.

tinjaw · « **Reply #1 on:** October 02, 2009, 02:31 PM »

Well, I was going to suggest tidy.

I still think it is you best choice and you should probably just see if Google can get you the correct command line voodoo you need.

tranglos · « **Reply #2 on:** October 02, 2009, 04:07 PM »

Tidy has all sorts of options for input and output encoding, yet it always gives me entities. And among the software I have, TopStyle, CSE Validator and HTMLPad, all can do the conversion, and all rely on Tidy

it's off to Google then!

tranglos · « **Reply #3 on:** October 02, 2009, 08:40 PM »

Ah. Case solved. Unless it's utf-8, Tidy only speaks English. No matter that there are perfectly valid encodings such as the iso-8859-* line. Tidy will either output entities or do the worst thing imaginable: reduce é to e, ą to a, ź to z... etc, for any character above ascii 127.

Tidy should come with a special warning label. It can make a grown linguist cry

The problem was actually compounded by what may be a bug in TopStyle 4. Tidy does the right thing if input is utf-8, which TopStyle officially supports now. Yet for some reason when data comes back from Tidy to TopStyle, you get "raw" utf-8, also known as garbage. If you save the changes, it's search and replace next. But it's not Tidy's fault, since HTMLPad 2008 manages to get utf-8 there and back cleanly.

Now I only need to convert a batch of files from iso-8859-2 to utf-8, *and* remove the meta charset declaration from all the files first, yay!

(Moral of the story: someone who speaks Python or Perl could probably achieve in three minutes what's taken me half a day already.)

housetier · « **Reply #4 on:** October 02, 2009, 08:51 PM »

From the man page and "tidy -show-config" I would suggest trying these command-line options:

-asxhtml convert HTML to well formed XHTML
-output-encoding utf8

(Maybe the output-encoding option has to be written in a config file, I am not sure)

tranglos · « **Reply #5 on:** October 02, 2009, 09:18 PM »

-asxhtml convert HTML to well formed XHTML
-output-encoding utf8
-housetier (October 02, 2009, 08:51 PM)

Tidy will output utf-8 only if input is utf-8 as well, or - if I understand the docs correctly - if input is one of the charsets it supports, that is ascii, Latin1, iso-8859-1 and a few more exotic ones. Other input charsets are explicitly documented as unsupported: Tidy will accept vendor specific character values, but will use entities for all characters whose value > 127: http://tidy.sourcefo...f.html#char-encoding

(I'm only surprised that for non-Western charsets the "raw" switch doesn't do what you'd think it would. The doc referenced above reads For raw, Tidy will output values above 127 without translating them into entities., but that's not what happened. Tidy "normalized" everything above 127 to the 0-127 range, thus corrupting data, if input was iso-8859-2 or Windows-1250, i.e. Central European. Seems like utf-8 on both sides of Tody is the only way to go.)

I've got it now though

Author Topic: HTML to XHTML conversion, what's most effective? (Read 4876 times)

tranglos

HTML to XHTML conversion, what's most effective?

tinjaw

Re: HTML to XHTML conversion, what's most effective?

tranglos

Re: HTML to XHTML conversion, what's most effective?

tranglos

Re: HTML to XHTML conversion, what's most effective?

housetier

Re: HTML to XHTML conversion, what's most effective?

tranglos

Re: HTML to XHTML conversion, what's most effective?