Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 10, 2016, 04:54:29 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: SOLVED: convert html tags/attributes to lowercase  (Read 4738 times)

hgondalf

  • Participant
  • Joined in 2010
  • *
  • default avatar
  • Posts: 15
    • View Profile
    • Donate to Member
SOLVED: convert html tags/attributes to lowercase
« on: September 02, 2012, 11:59:05 AM »
The problem I need to solve is this: I have lot of books that I have typeset in Framemaker 5.5.6, Pagemaker 7, or Word format.  I need t o convert them to e-books: mobi, epub, etc format. I will be using calibre.exe for the conversion. E-book source is always in html. I want to use proper html/css programming so that there is a uniformity to the look & feel of the e-publications.

All of the word processing apps have an “export to html” function. Trouble is, the output html is all upper case, e.g., <P ALIGN="center">. I like to code web pages so that they comply with xhtml 1.0 strict, which means no upper case tags or attributes, among other things. (No ALIGN attribute, for instance!)

I have in mind a little app that would rest on the desktop, that I could drag and drop a file onto. Something like the good old DOS2UNIX.exe that converts line endings from CRLF to LF. No GUI, in other words.

I have started working on a little autohotkey app, but I''m not so skilled. (If someone would just help me with the code, I would be very grateful!) The code I have so far is copied below, but I consider it to be pseudocode rather than actual code.

BTW, this is exactly the type of app that is best written in assembler, because it's so simple. Why didn't anybody think of it before? --Google searches don't even come close. 

Logic, in brief:
Set a couple of boolean variables to false, and a pointer to 0
Scan through the file byte by byte looking for "<"   ** NOTE
If found, go into conversion mode, lowercase everything
if converting, and next character is " quote mark, temporarily stop converting
on second " quote mark, resume converting
until ">" is found
until EOF

** Since the source file is output from a word processor, all instances of “<” etc are automatically converted to their html codes &lt; etc, so any “<” encountered will have to be the start of an html tag.

------------------------------------ code follows

; maxmem default = 64 MB. Inputfile must be less than 64MB

filecopy, test-case.html, test-case.bak, 1 ; overwrite any existing bak file
filegetsize, FSize, test-case.html             ; test-case.html is my test file
; msgbox, "Size is "%FSize%

Ptr = 0       ; pointer to character position in BigStr
InMarkup = 0  ; false
QuoStr = 0    ; false
TChar =       ; temp char to hold bigstr character

fileread, BigStr, test-case.html
FSize := StrLen(BigStr)
; msgbox Outside loop. Filesize: %FSize% Ptr: %Ptr%     ; for debugging
if not errorlevel
{
;  msgbox Not errorlevel              ; for debugging
  loop while (%Ptr% < %FSize%)
  {
    msgbox In loop while...          ; for debugging
    if (InMarkup)
      {
      if (!QuoStr)
        {
          if (TChar >= "A" && TChar <= "Z") {
            BigStr%Ptr% := %TChar%+32 }
          if (TChar==34) {
            QuoStr:=1  }
          if (TChar= ">") {
            InMarkup:=0 }
        }
      if (TChar==34) {
        QuoStr:=0 }
      }
    else
      {
        if (TChar = "<" && BigStr%Ptr%+1<>"!") {
          InMarkup:=1 }
      }
    Ptr:=%Ptr%+1
  }
 }
  FileDelete, test-case.html
  FileAppend, %BigStr%, test-case.html
  BigStr =  ; Free the memory.



------------------------------------ code ends

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,290
    • View Profile
    • Miles Ahead Software
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #1 on: September 02, 2012, 01:57:08 PM »
I use PsPad for html. I just do it for my site. I'm not a web programmer. It includes Html Tidy. But there's a command line version that may be good for batch.

http://tidy.sourceforge.net/#binaries

I've only used Tidy on one file at a time inside PsPad. There may be settings to clean all pages in a project. But I'm not sure.

hgondalf

  • Participant
  • Joined in 2010
  • *
  • default avatar
  • Posts: 15
    • View Profile
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #2 on: September 03, 2012, 08:31:27 AM »
Thanks for the tip. Unfortunately, has no executable for Windows 7, and unfortunately is no longer supported. One can check out the source code, if one happens to have a compiler handy, I guess.

Komodo IDE (hundreds of bucks) has one, and also there are macros for Komodo Edit 7 (free). 

If I get around to writing this app I'll post it here.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,290
    • View Profile
    • Miles Ahead Software
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #3 on: September 03, 2012, 12:24:29 PM »
If you have lots of input files then stream editing may be the way to go. I suggest you check out the sed stream editor available, with a lot of other Linux tools, to run on Windows:

http://gnuwin32.sour...ge.net/packages.html

Also you could search for Tidy. There are lots of variants around.  Many times command line tools for Windows don't bother with modifications after XP.  The command line approach doesn't really vary unless they are doing something via power shell.  Reading standard input and writing standard output has been around for a long time.

edit: searching terms:  sed lower case html tags
comes up with this sed script:

http://sed.sourcefor.../scripts/html_lc.sed

There's more of a command line filter heritage in Linux than Windows. If it's some common task reading input, filtering, and sending it to the output, chances are it's already been done in one or more of the Linux tools like nawk, sed, perl.

« Last Edit: September 03, 2012, 04:25:05 PM by MilesAhead »

kfitting

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 578
    • View Profile
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #4 on: September 03, 2012, 01:12:45 PM »
Dont know if this is too much work, but at least for completeness, you could look at HTML Tidy:
http://tidy.sourceforge.net/

Look along the right side for some apps that use it... there is one [slightly out of date] GUI as well.

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,290
    • View Profile
    • Miles Ahead Software
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #5 on: September 03, 2012, 03:01:52 PM »
Another approach may be to go on AutoHotkey forum and request a regex that will do the conversion. Regular expressions are not my forte. :)

hgondalf

  • Participant
  • Joined in 2010
  • *
  • default avatar
  • Posts: 15
    • View Profile
    • Donate to Member
Re: SOLVED: convert html tags/attributes to lowercase
« Reply #6 on: September 03, 2012, 09:55:59 PM »
Thanks to all for your help. I finally navigated fully the tidy page at sourceforge, and found the gui version, which solved my needs.

Kindly mark this issue SOLVED.