DonationCoder.com Software > Finished Programs
DONE: HTML Garbage Tag Removal
pulphero:
In constructing ebooks, I often run into these unnecessary HTML tag pairs in files exported from InDesign:
<span class="no-break">I don</span>’t
or:
<span class="no-break">my decision.</span>
It always follow the pattern of <span class="no-break"> followed by some random amount of text and then a closing </span>. Deleting these tags (and there are lots of them!) manually in SublimeText is a huge time sink.
I'd like to have an AHK script or Regex code that will delete these specific tags but not the text between them, leaving, for example:
I don't
my decision.
Although it seems a simple problem, I've not been able to come up with anything that works. Thanks!
MilesAhead:
How big are the files? Perhaps this web applet is good enough?
http://www.striphtml.com
Edit: Hmm, seems to do nothing. Perhaps it only works if there are html header or body tags. I don't know.
Also I got this from "sed one-liners site:"
# remove most HTML tags (accommodates multiple-line tags)
sed -e :a -e 's/<[^>]*>//g;/</N;//ba'
sed is a very powerful free stream editor. It can do many things way faster than an interactive edit session. There are free versions for Windows:
http://gnuwin32.sourceforge.net/packages/sed.htm
The page of "one-liner" sed scripts:
http://sed.sourceforge.net/sed1line.txt
The idea is the file to be modified is fed into sed via command line redirection usually, and the output redirected to a new file. It modifies the file in one shot.
pulphero:
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.
The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.
I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.
MilesAhead:
StripHTML is a great app, but I believe it takes out ALL the HTML tags. I need only a specific opening/closing tag removed.
The files aren't that big. Most are a few hundred lines of code. For each book, there's usually two or three dozen files.
I'm really just looking for a dirt-simple AHK script or regex code. Doubt I'm savvy enough to handle something like SED.
-pulphero (June 03, 2017, 10:11 AM)
--- End quote ---
Why not grab one of those freeware "regex tester" programs? You put in sample text, and a regex. Hit the Go Button and it shows the results. Most regex I get by trial and error myself. I don't use it often enough to predict what will happen.
Here's one from sourceforge but there are a bunch of them out there:
https://sourceforge.net/projects/regextester/
Ath:
Well,
I tried (with) a trial version of Sublime Text (v3 build 3126 x64), and I cooked this regex to remove the extra span tags:
--- ---Find What: (?m)<span\sclass="no-break">(.*?)<\/span>(.*?)
Replace With: \1\2
Be sure to enable the Regular expression option (Alt-R from the Replace screen). Switching on Wrap is required too.
Navigation
[0] Message Index
[#] Next page
Go to full version