Author Topic: DONE: HTML Garbage Tag Removal (Read 8737 times)

pulphero · « **on:** June 03, 2017, 09:22 AM »

In constructing ebooks, I often run into these unnecessary HTML tag pairs in files exported from InDesign:

I don’t

or:

my decision.

It always follow the pattern of followed by some random amount of text and then a closing . Deleting these tags (and there are lots of them!) manually in SublimeText is a huge time sink.

I'd like to have an AHK script or Regex code that will delete these specific tags but not the text between them, leaving, for example:

I don't
my decision.

Although it seems a simple problem, I've not been able to come up with anything that works. Thanks!

MilesAhead · « **Reply #1 on:** June 03, 2017, 10:00 AM »

How big are the files? Perhaps this web applet is good enough?

http://www.striphtml.com

Edit: Hmm, seems to do nothing. Perhaps it only works if there are html header or body tags. I don't know.

Also I got this from "sed one-liners site:"

# remove most HTML tags (accommodates multiple-line tags)
sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

sed is a very powerful free stream editor. It can do many things way faster than an interactive edit session. There are free versions for Windows:
http://gnuwin32.sour...net/packages/sed.htm

The page of "one-liner" sed scripts:
http://sed.sourceforge.net/sed1line.txt

The idea is the file to be modified is fed into sed via command line redirection usually, and the output redirected to a new file. It modifies the file in one shot.

pulphero · « **Reply #2 on:** June 03, 2017, 10:11 AM »

StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.

MilesAhead · « **Reply #3 on:** June 03, 2017, 10:21 AM »

StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.

-pulphero (June 03, 2017, 10:11 AM)

Why not grab one of those freeware "regex tester" programs? You put in sample text, and a regex. Hit the Go Button and it shows the results. Most regex I get by trial and error myself. I don't use it often enough to predict what will happen.

Here's one from sourceforge but there are a bunch of them out there:
https://sourceforge....rojects/regextester/

Ath · « **Reply #4 on:** June 03, 2017, 10:40 AM »

Well,

I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:

[Select]

Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2

Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.

pulphero · « **Reply #5 on:** June 03, 2017, 10:58 AM »

Well,

I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:
[Select]
Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2
Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.
-Ath (June 03, 2017, 10:40 AM)

Wow! It works perfectly. Thanks so much for doing this.

Ath · « **Reply #6 on:** June 03, 2017, 12:16 PM »

And on special request, replacing one by one you could use this replacement:

[Select]

Find What: (?m)<span\sclass="i">(.*?)<\/span>
Replace With: \1

The disadvantage here is that after this replacement you won't be able to use css for changing all your global emphasized text into emphasized (or a whole different font or whatever) by a single css modification anymore. But I'm not sure that would be useful for an ebook

skwire · « **Reply #7 on:** June 03, 2017, 12:47 PM »

Nice work, Ath. I'm going to mark this as done. Cheers.

Author Topic: DONE: HTML Garbage Tag Removal (Read 8737 times)

pulphero

DONE: HTML Garbage Tag Removal

MilesAhead

Re: HTML Garbage Tag Removal

pulphero

Re: HTML Garbage Tag Removal

MilesAhead

Re: HTML Garbage Tag Removal

Ath

Re: HTML Garbage Tag Removal

pulphero

Re: HTML Garbage Tag Removal

Ath

Re: HTML Garbage Tag Removal

skwire

Re: HTML Garbage Tag Removal