ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Finished Programs

DONE: HTML Garbage Tag Removal

(1/2) > >>

pulphero:
In constructing ebooks, I often run into these unnecessary HTML tag pairs in files exported from InDesign:

<span class="no-break">I don</span>’t

or:

<span class="no-break">my decision.</span>

It always follow the pattern of <span class="no-break"> followed by some random amount of text and then a closing </span>. Deleting these tags (and there are lots of them!) manually in SublimeText is a huge time sink.

I'd like to have an AHK script or Regex code that will delete these specific tags but not the text between them, leaving, for example:

I don't
my decision.

Although it seems a simple problem, I've not been able to come up with anything that works. Thanks!

MilesAhead:
How big are the files?  Perhaps this web applet is good enough?

http://www.striphtml.com

Edit:  Hmm, seems to do nothing.  Perhaps it only works if there are html header or body tags.  I don't know.

Also I got this from "sed one-liners site:"

 # remove most HTML tags (accommodates multiple-line tags)
 sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

sed is a very powerful free stream editor.  It can do many things way faster than an interactive edit session.  There are free versions for Windows:
http://gnuwin32.sourceforge.net/packages/sed.htm

The page of "one-liner" sed scripts:
http://sed.sourceforge.net/sed1line.txt


The idea is the file to be modified is fed into sed via command line redirection usually, and the output redirected to a new file.  It modifies the file in one shot.

pulphero:
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.


MilesAhead:
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.

The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.

I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.



-pulphero (June 03, 2017, 10:11 AM)
--- End quote ---

Why not grab one of those freeware "regex tester" programs?  You put in sample text, and a regex.  Hit the Go Button and it shows the results.  Most regex I get by trial and error myself.  I don't use it often enough to predict what will happen.

Here's one from sourceforge but there are a bunch of them out there:
https://sourceforge.net/projects/regextester/

Ath:
Well,

I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:

--- ---Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2
Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.

Navigation

[0] Message Index

[#] Next page

Go to full version