Author Topic: Dottie and the Backpacker Killer (Read 2725 times)

ital2 · « **on:** July 21, 2017, 05:50 AM »

Let's put it simple: Whenever some lunatic encounters some backpacker, tracking along all alone on his track, or even accompanied by her boyfried or even some other chick, he's ready to kill (I even feared for Sue - and that was in a film! - when Crocodile* left her alone in the outback just for some minutes!); whenever you encounter a dottie in some regex of yours, be it accompanied by a star, a plus sign or else, get ready to kill, too!

*: Btw, they met on the set and have been happy together for 31 years now, that's what I call love! Cheerio!

And here's the mid-length version: Today, in more and more applications which offer search (as for just one example, PIMs; programming languages anyway), you'll encounter the alternative of doing a regex search instead, and that's a very good thing, but it's a fact that many people who use regex even on a regular basis, do so without the necessary knowledge to use it smartly.

It's common knowledge that regex is "slow"; as far as I'm concerned, I've always bewared of the dot indeed, but without knowing and withoutt really optimizing (grouping yes, but not enough excluding yet); in fact, I've always been aware of the fact that even the (regular) "greedy" quantifiers (regularly, not always) give correct results when by my feeling, only the "lazy" ones (available by additional "?" or by leading option code, for the whole expression) should do.

The solution to this apparent contradiction lies in automatic backtracking - well explained in the final link below -, and now you easily understand why regex, doing innumerable backtracks in order to match the text with a (bad) search expression, can be very slow indeed; of course, when we search for something, we want just the search to function well, we don't want to understand how the search is really done, but that's exactly what we first MUST know in order to be able to write effective (or even just "correct") regex searches, and that infortunately means - as a rule with few exceptions -, the better you want your search, the more inscrutable (for a beginner) your search expression will become.

It's evident that if you need such a search expression for some single task, you'll do it quick'n'dirty and will hope for the best, comparing the results with what you will have expected, and if these match, you'll be done with it - look out though for both false positives as for misses, both are so easy to get with approximative regex; whenever you need a search expression to do its work on something again and again, you better had it made as specific for that kind of text as you can, in order to make it fast-and-reliable.

So we encounter a new problem here: The more specific your regex, the higher the probability it will not match given texts, and there's two ways (which I currently see; there may be additional ones) to counter these problems: First, have preliminary scripts (with or without regexes, probably including some of these) to first check your target texts, and to hopefully alert you whenever your main script/regex will not be successful (again, false positives and/or misses) for compatibility reasons - you cannot anticipate any possible issue this way, but you can exclude here most of the "usual suspects" indeed.

Second, prefer "pure-scripting" over regex whenever reasonable, i.e. your regex will be enclosed in some script anyway, so do multiple things on the macro level with that language's commands, incl. searches, cutting up (and checking) text parts (process logic), and keep regex expressions mostly to some further micro processing, i.e. to analysis/retrieval within previously-segregated parts of the individual text. (And let me repeat here that you avoid many a problem by programmatically checking the results of any regex match (or non-match), on the micro as on the macro level.)

This way, it will also become much easier to better target your regex expressions, i.e. to write them in the most specific way there is.

From the links which follow you will learn that even lazy quantifiers will not save your regex expressions from becoming too broad which will not make them necessarily faulty but which will render them, often quite incredibly, expensive (i.e. slow), by the way regex executes the search on the technical level: Group, and be as avaricious as you are ever able to be, with what you grant any such group, instead of counting on regex' capability to clear up from behind the mess you threw at it, at the cost which will come with that unnecessary clearing-up: avoid the backtracking, avoid a backpack weighting ton, to be cleared gram by gram by regex for you when you (your script/search) should run.

(I've got several, now-optimized, regexes which run several seconds on an i7700 with plenty of memory; now imagine those searches running, non-optimized, on some i3, let alone on my old laptop, so our point - see the links - is not trivial.)

As for the long version - the intent of this post being to create awareness only, I'll let do the explaining to the experts -, please refer to the following links:

https://blog.mariuss...at-you-actually-want

http://www.regular-e...ons.info/repeat.html

http://www.regular-e...ons.info/atomic.html

And finally this one which explains in depth why and how backtracking can even go completely wild:
http://www.regular-e...fo/catastrophic.html

There's two, absolutely authoritarian, standard books on regex: Friedl: Mastering Regular Expressions, and Goyvaerts: Regular Expressions Cookbook; I own both, but I admit I never had the courage to really read those books; I've only used them, up to now, for looking up things/explanations, but a quick search for "Regular Expressions" on amazon.com will show you that there are some introductory books on the subject, too, from several other authors, and which may be a little bit more "accessible", or then, on some rainy Sunday, I should have a good look again into the writings I already own?

Goyvaerts is the author of the site regular-expressions.info (the links above) and of several / highly sophisticated regex tools - I never saw the need for me to buy his PowerGREP - was 119€, is now 139€ plus VAT - since I like to write my code myself, so for example (the independent) TextPipe Pro (395$ plus VAT on datamystic.com, here and there available on bitsdujour for a very reasonable price, though) didn't seriously tempt me either, but this latter tool seems to be more in the line of what I've said above about scripting-plus-regex - since behind the scenes, it seems to use exactly that hybrid approach, and in a largely broader way than PowerGREP (i.e. within the files, not only to do the switching between files) - I could be mistaken here, just speaking from their sites' descriptions. On the other hand, when you're a regex expert to the degree Goyvaerts (and Friedl) indubitably are, you can permit yourself to do more in and by regex since writing (correct! fast!) regex expressions will only take you a fraction of time it'd would cost us.

This being said, from his examples in the very last link above, it becomes evident that Goyvaerts' RegexBuddy (30€ plus VAT, https://www.regexbuddy.com/ ) is a highly valuable tool for anyone writing their own regexes, so I'm going to finally buy that one, obviously should have done that before. (You know the lines along which I write, so you know this isn't meant as an advertisement for a product, but as a service to the reader.)

Author Topic: Dottie and the Backpacker Killer (Read 2725 times)

ital2

Dottie and the Backpacker Killer