topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 29, 2024, 5:45 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: DONE: HTML Garbage Tag Removal  (Read 7086 times)

pulphero

  • Supporting Member
  • Joined in 2007
  • **
  • default avatar
  • Posts: 24
    • View Profile
    • Donate to Member
DONE: HTML Garbage Tag Removal
« on: June 03, 2017, 09:22 AM »
In constructing ebooks, I often run into these unnecessary HTML tag pairs in files exported from InDesign:

<span class="no-break">I don</span>’t

or:

<span class="no-break">my decision.</span>

It always follow the pattern of <span class="no-break"> followed by some random amount of text and then a closing </span>. Deleting these tags (and there are lots of them!) manually in SublimeText is a huge time sink.

I'd like to have an AHK script or Regex code that will delete these specific tags but not the text between them, leaving, for example:

I don't
my decision.

Although it seems a simple problem, I've not been able to come up with anything that works. Thanks!

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #1 on: June 03, 2017, 10:00 AM »
How big are the files?  Perhaps this web applet is good enough?

http://www.striphtml.com

Edit:  Hmm, seems to do nothing.  Perhaps it only works if there are html header or body tags.  I don't know.

Also I got this from "sed one-liners site:"

 # remove most HTML tags (accommodates multiple-line tags)
 sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

sed is a very powerful free stream editor.  It can do many things way faster than an interactive edit session.  There are free versions for Windows:
http://gnuwin32.sour...net/packages/sed.htm

The page of "one-liner" sed scripts:
http://sed.sourceforge.net/sed1line.txt


The idea is the file to be modified is fed into sed via command line redirection usually, and the output redirected to a new file.  It modifies the file in one shot.

« Last Edit: June 03, 2017, 10:08 AM by MilesAhead »

pulphero

  • Supporting Member
  • Joined in 2007
  • **
  • default avatar
  • Posts: 24
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #2 on: June 03, 2017, 10:11 AM »
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.



MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #3 on: June 03, 2017, 10:21 AM »
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.




Why not grab one of those freeware "regex tester" programs?  You put in sample text, and a regex.  Hit the Go Button and it shows the results.  Most regex I get by trial and error myself.  I don't use it often enough to predict what will happen.

Here's one from sourceforge but there are a bunch of them out there:
https://sourceforge....rojects/regextester/

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #4 on: June 03, 2017, 10:40 AM »
Well,

I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:
Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2
Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.

pulphero

  • Supporting Member
  • Joined in 2007
  • **
  • default avatar
  • Posts: 24
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #5 on: June 03, 2017, 10:58 AM »
Well,

I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:
Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2
Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.

Wow! It works perfectly. Thanks so much for doing this.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #6 on: June 03, 2017, 12:16 PM »
And on special request, replacing <span class="i">one</span> by <i>one</i> you could use this replacement:
Find What: (?m)<span\sclass="i">(.*?)<\/span>
Replace With: <i>\1</i>
The disadvantage here is that after this replacement you won't be able to use css for changing all your global emphasized text into emphasized (or a whole different font or whatever) by a single css modification anymore. But I'm not sure that would be useful for an ebook ;)

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,286
    • View Profile
    • Donate to Member
Re: HTML Garbage Tag Removal
« Reply #7 on: June 03, 2017, 12:47 PM »
Nice work, Ath.  I'm going to mark this as done.  Cheers.