ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

HTML to XHTML conversion, what's most effective?

(1/2) > >>

tranglos:
What would be the most effective and free solution to convert a bunch of html 4.0 files to pure xhtml? Ideally without having to open each file individually in some editor? (Not an online service either, because it will take ages to do for a lot of files.)

I've been trying Html Tidy, but I can't figure out how to make it output utf-8 or even ISO codepage characters. Whatever settings, I try, it insists on outputting diacritic ("high ascii") characters as entities, such as {.

tinjaw:
Well, I was going to suggest tidy.  :-[

I still think it is you best choice and you should probably just see if Google can get you the correct command line voodoo you need.

tranglos:
Tidy has all sorts of options for input and output encoding, yet it always gives me entities. And among the software I have, TopStyle, CSE Validator and HTMLPad, all can do the conversion, and all rely on Tidy :)

it's off to Google then!

tranglos:
Ah. Case solved. Unless it's utf-8, Tidy only speaks English. No matter that there are perfectly valid encodings such as the iso-8859-* line. Tidy will either output entities or do the worst thing imaginable: reduce é to e, ą to a, ź to z... etc, for any character above ascii 127.

Tidy should come with a special warning label. It can make a grown linguist cry :'(

The problem was actually compounded by what may be a bug in TopStyle 4. Tidy does the right thing if input is utf-8, which TopStyle officially supports now. Yet for some reason when data comes back from Tidy to TopStyle, you get "raw" utf-8, also known as garbage. If you save the changes, it's search and replace next. But it's not Tidy's fault, since HTMLPad 2008 manages to get utf-8 there and back cleanly.

Now I only need to convert a batch of files from iso-8859-2 to utf-8, *and* remove the meta charset declaration from all the files first, yay! :)

(Moral of the story: someone who speaks Python or Perl could probably achieve in three minutes what's taken me half a day already.)

housetier:
From the man page and "tidy -show-config" I would suggest trying these command-line options:

-asxhtml convert HTML to well formed XHTML
-output-encoding utf8

(Maybe the output-encoding option has to be written in a config file, I am not sure)

Navigation

[0] Message Index

[#] Next page

Go to full version