topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Sunday December 15, 2024, 5:37 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: RegEx Vicinity Search / Regular Expressions: Search string a Near string b  (Read 9202 times)

ital2

  • Member
  • Joined in 2017
  • **
  • default avatar
  • Posts: 115
    • View Profile
    • Donate to Member
or How to write regular expressions without losing your mind

The following AHK script line reads a file into the variable tb (you could search the text file instead, but we use a variable for much better speed; tb stands in for the textbody-to-be-searched, but you could use replace tb by the clipboard for simple searches; we don't use the AHK clipboard variable here since we need that further on when we do a script (which includes an altered regex search for the finds) which will not only search once but will search for all occurrences and build a results table for them.

fileread, tb, C:\AHK\regexvicinitysearch.txt ; or your respective full file path (; plus text are comments)
Don't mind the mix-up of variable and (path) string here, both without "", that's crazy AHK command (contrary to AHK expression) syntax (some commands aren't available but in command form, others in both (in 2017, they've been working on this for years now...); just always do it exactly as it is spelled out here.

We begin with a simple regex search; in fact, it's so simple that you would use a regular search instead, but it's for clarity reasons here:
pos := regexmatch(tb, "youronlysearchstring")
pos is the output variable for the command and is either 0 or in case of success, will contain the "character position" of the very first character of the found (complete) searchstring.
Additionally, you can also retrieve the found searchstring (which is devoid of sense of course if it doesn't contain any unknown-yet elements, as here):
pos := regexmatch(tb, "youronlysearchstring", result)

Then, you will like to see the output, ie position of the find (if any), and the full found string:
msgbox, %pos%`n%result%
This will show you the position-in-text number of the first character of the search string (0 if not found) and then, in a second line (and if found), the full search string, ie in this primitive example, just your searchstring which you know already, but later on, it'll be your firststring, anything in-between (which is the reason for which this output string is of interest), and your secondstring, or even more complicated search results. (As with any other variable, you can replace result by any other name except for "reserved" names.) Hence, simple regex search:

fileread, tb, C:\AHK\regexvicinitysearch.txt
pos := regexmatch(tb, "youronlysearchstring", result)
msgbox, %pos%`n%result%

_________________________

X near Y (just for basic explanation means)

Then you would like to find 2 strings within the vicinity of each other (either order, x then y or y then x), so this is an "OR" (by "|") search, since that's the subject of this thread.

We group the "OR" elements (yes, into groups) by parentheses "()", and done:
pos := regexmatch(tb, "(yourfirststring.{0,30}yoursecondstring)|(yoursecondstring.{0,20}yourfirststring")

NB: yourstrings are not variables in this expression, but just placeholders for your respective strings (so when you replace them with your actual strings, don't add any more "" here).

tb is our variable, so no "" here; instead, you could put in the full text to be searched here, enclosed in "" then; this is just theory though since even when this string is rather short, ie is some substring, you regularly will use a variable instead, since you will have read that substring into that variable to be searched further then.

.{n,m} First for the dot, which means "any character". After the dot, you have a multiplier (for "how many occurrences of any "any character" for dot or any other immediately-preceding element (which can also be a group in case)).
The dot includes even possible linebreaks, but by option, you can exlude them by option as we'll do later on; for vicinity search, that's often wanted. (Remember that line breaks quite rarer than screen wraps.)

The multipliers: Asterisk * is for "any number incl. 0", plus sign + is for "any number but at least 1", {n} is for "number n", {n,m} or {n, m} is for "at least, at most", for example {0, 30} which could also written as {,30}, and similarly, you could write {30,} for "at least 30" while {30} would be "exactly 30".

NB: If you use variables instead or numbers (integers), n and m or whatever (as we will do below), then these variables must be written outside of the substrings in AHK, i.e. outside of the "" parts, and even the surrounding spaces are necessary (see the code below), so the slightest typing error will break or falsify your regex expression; on the other hand, if you use templates, like we do below, once they're set up correctly, you won't touch them anymore, and thus they will continue to be reliable; in fact, that's the other big advantage of using variables instead of strings within regex expressions: programmatical reusability, AND robustness.

Your while firststring and its character length is "eaten" by the yourfirststring match, so its length doesn't count anymore into the length of this "anything in-between", but it counts of course into the length of the (variable) total search result string (variable since the "in-between" is of variable length), so if your script does multiple searches, for multiple occurrences, these sub-string lenghts (all 3 of them, first, in-between, second) become relevant for the correct starting positions of the subsequent search runs.

As you see from our example, you can use different distances for "x then " and "y then x"; albeit not being certain of their usefulness, I've introduced them into my expressions, together with variable assignments which prevent you from typing identical data multiple times, see below. Here the code with just a minimum of variables, but which is bad since you will probably damage your regex expression by typing your real values into it (search from begin as default); note that there are no global (), just for grouping (2 OR groups here):

; X near Y (just for basic explanation means)

fileread, tb, C:\AHK\regexvicinitysearch.txt
pos := regexmatch(tb, "(yourfirststring.{0,30}yoursecondstring)|(yoursecondstring.{0,20}yourfirststring)", result)
msgbox, %pos%`n%result%

_________________________

1) X near Y

The same with extensive variable use, so that your regex expression would stay unchanged. This is real ugly but presents the advantage that the expression will not change anymore, so you can easily and programmatically put use any other values for the search terms and for the distances into it (the expression becomes a reusable template); you also see that of course, you can use different distances for either order.
NB: In the AHK expression, the spaces between the variables and the substrings are mandatory. (You could use dots instead, but then your expression would become really unreadable: dots for concatenation, and dots for "any char".) So this is the code for ONE search, ie which will find the very first occurrence of y after x or x after y (;= for string assignments, = for integer/numeric assignments, all distances are the respective max distance we allow for between the elements in question):

; 1) X near Y

; your input instead of mine here:
tb := "C:\AHK\regexvicinitysearch.txt"
x := "yourstring1"
y := "yourstring2"
xy = 30
yx = 20

; but don't touch these anymore:
fileread, tb, %tb% ; that's for showing you how crazy AHK is:
; 1) the same variable name for (even totally different kinds (but not formats) of) input (path) and output (content) - my fun
; 2) two variables in one command, first one mandatorily without %%, second one mandatorily with %% - their fun
pos := regexmatch(tb, "(" x ".{0," xy "}" y ")|(" y ".{0," yx "}" x ")", result, 1)
msgbox, %pos%`n%result%


_________________________

2) X near (Y or Z)

Then, also of high practical value is the search template for "x AND (y OR z)"; you don't need permutations here since your "must" element will be x, and your "either" elements will be within the "OR" group; then you have:

x, then up to x-to-yz chars, then (either y or z)
OR
(either y or z), then up to yz-to-x chars, then x

Hence:
We need two outer groups: (mandatory element before either-element) OR (the other way round),
and each group contains 3 sub-elements: 1 single element, the "in-between", the either/or group in () (order irrelevant):
inner:
x ... (y|z) | (y|z) ... x
and with outer:
( x ... (y|z) ... ) | ( (y|z) ... x )
As you see, that's easy, albeit in regex it becomes almost unreadable, as we see later on.

x AND y (full string) is: "(" x ".{0," xy "}" y ")"
ditto with the ()-grouped alternative y/z and the new distance: "(" x ".{0," x_yz "}(" y "|" z "))"
and the other way round: "((" y "|" z ").{0," yz_x "}" x ")"
we combine them, the order is irrelevant, but we preserve this order, in order to show that the double (()) are not global but are constituted by partial ():
"(" x ".{0," x_yz "}(" y "|" z "))""((" y "|" z ").{0," yz_x "}" x ")"
we must delete the central "", and we must insert the central | (I don't speak of "replace... with..." since these "musts" are independent from each other):
"(" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")"

Combining the outer ones the other way round would have given: "((" y "|" z ").{0," yz_x "}" x ")|(" x ".{0," x_yz "}(" y "|" z "))" - as you can see, the (()) are misleading at first sight, but of course, technically, the variants are quasi-identical. In fact, they are not, since here, the outer OR search also begins with an inner OR search, so this search should be slower than the virst variant advocated by me.

Hence we've got:

; 2) X near (Y or Z)

tb := "C:\AHK\regexvicinitysearch.txt"
x := "yourmuststring"
y := "youreitherstring1"
z := "youreitherstring2"
x_yz = 30
yz_x = 20

; don't touch these:
fileread, tb, %tb% ; of course, you should better use tf for textfile and tb for textbody, in any non-AHK...
pos := regexmatch(tb, "(" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")", result, 1)
msgbox, %pos%`n%result%


You must understand that whenever there is an OR, there is some precedence, ie whenever the regex matches the full string, according to its expression and according to the repartition of the elements in that expression, it will end the current search, while you might have been interested in alternative matches which will never come though. For example, whenever there is a match for yourmuststring AND youreitherstring1 (which ist searched for BEFORE any possible match for yourmustring AND youreitherstring2), ie a common occurrence of both within the max distance x_yz, the current search will stop, EVEN if youreithersearch2 is even nearer to yourmuststring than youreitherstring1; since our regex searches in mixed orders, x before y-or-z, then x after y-or-z, the final search results can even be more surprising.

So you must bear in mind that any valid search result, resulting from an earlier combination search of your current regex run, will be shown, to the detriment of any other search result by that same run, which also would have been valid after all, but which isn't searched for anymore. So inept or just unfortunate OR-combi searches can hide the results you're after in some cases; if you want to prevent this at all cost - most of the time, this is not necessary though, since in "X near (Y or Z)", the X is of interest, not one of the other two elements, and the X, the yourmuststring, will be found in any case -, you have to resolve your "X near (Y or Z)" search into 2 different searches "X near Y" and "X near Z", then will have to combine (and sort) the results, which is perfectly possible of course.

____________________

3) X, Y and Z All Together

Since we've treated x AND (y or z), let's also do x AND y AND z, which is also of high practical value (they are in any which order here, so with 3 elements, we must search for 6 combinations, as we will see):

; X, Y and Z All Together

; here your input:
tb := "C:\AHK\regexvicinitysearch.txt"
x := "yourstring1"
y := "yourstring2"
z := "yourstring3"

Then your max distances again:
xy = 30
yx = 20
xz = 10
zx = 15
yz = 12
zy = 18

or you can simplify:
; more input:
m = 30 ; or any other value common for them all or some group only:
n = 40
xy := m
yx := m
xz := m
zx := n
yz := n
zy := n

we need the elements in any which of these orders:
xyz
xzy
yxz
yzx
zxy
zyx

; pairs (full strings):
; xy: "(" x ".{0," xy "}" y ")"
; yx: "(" y ".{0," yx "}" x ")"

; hence the triples:
xyz: "(" x ".{0," xy "}" y ".{0," yz "}" z ")"
xzy: "(" x ".{0," xz "}" z ".{0," zy "}" y ")"
yxz: "(" y ".{0," yx "}" x ".{0," xz "}" z ")"
yzx: "(" y ".{0," yz "}" z ".{0," zx "}" x ")"
zxy: "(" z ".{0," zx "}" x ".{0," xy "}" y ")"
zyx: "(" z ".{0," zy "}" y ".{0," yx "}" x ")"

all 6 triples in a row (the order is irrelevant):
"(" x ".{0," xy "}" y ".{0," yz "}" z ")""(" x ".{0," xz "}" z ".{0," zy "}" y ")""(" y ".{0," yx "}" x ".{0," xz "}" z ")""(" y ".{0," yz "}" z ".{0," zx "}" x ")""(" z ".{0," zx "}" x ".{0," xy "}" y ")""(" z ".{0," zy "}" y ".{0," yx "}" x ")"

ditto but the (now faulty) "" deleted, and ORs inserted:
"(" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")"

and now the complete expression / end of the script:
; don't touch these:
fileread, tb, %tb%
pos := regexmatch(tb, "(" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")", result, 1)
msgbox, %pos%`n%result%

_________________________

4) Retrieve the whole line

(This implies that technically, the whole line of your "real" match, the one you're after, will become the new match.)

The templates above search for terms in the vicinity of each other, but give as only output the string from char 1 of the first substring matching that part of the match up to the last char of the last substring matching that part of the match, excluding, in case, any other substrings not needed anymore for the match (before or behind that "sufficient" match string) - see my final remarks for "2) X and (Y or Z)" above -, while in many cases, you will want to retrieve the whole line of the match anyway (for "context"). Furtunately, this is very easy. (^ stands in for line/text begin, $ stands in for line/text end, and the ? after the * makes the * non-greedy; sometimes you see it's not really needed, but see my Dottie Killer thread for its usefulness.)

So we need the option "search line by line" ( "`a)" in Ahk, atttention, there is an accent grave before the a, for the position see the templates), but a warning here, AHK's traditional m) option will often not work for line-by-line since it only identifies complete CRLFs as line breaks, so we add the `a) option, too (cannot harm except in very special cases, ie whenever you also distinguish lines, and then paragraphs comprising several lines).

Then we search for a global string comprised of these 3 groups:

1. (line-begin and anything-or-nothing)

2. (our complete, original search string from above, just adjusted at its start and its end accordingly: first, its start and end are merged into bigger substrings now, ie together with 1. and 3., respectively, so we delete the (now wrong) start-" and end-", but stop, second, we replace them even, with "(" and ")", respectively, since the original search string, not being "alone" anymore, but having become the inner group of 3, must be enclosed in () in order for it's inner logic to be upheld - unnecessary groupings will do no harm, but missing groupings will)

3. (anything-or-nothing up to the line-end)

I present the templetes in the form:

; the title
; the original, complete, original search string
the complete new expression; it's with this line obviously that you will have to replace the original expression above, everything else in the original templates remaining valid
in other words, we replace the first " of the original search strings by pos := regexmatch(tb, "m`a).*?(
and we replace the very last " of the original search string by ).*?$", result, 1)
Hence:

; 1) X near Y
; "(" x ".{0," xy "}" y ")|(" y ".{0," yx "}" x ")"
pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ")|(" y ".{0," yx "}" x ")).*?$", result, 1)

; 2) X near (Y or Z)
; "(" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")"
pos := regexmatch(tb, "m`a).*?((" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")).*?$", result, 1)

; 3) X, Y and Z All Together
; "(" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")"
pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")).*?$", result, 1)

Now the match and output is the whole line/paragraph, but containing whatever you're really after, together with the complete-line "context"; you could do variants which only show a certain number of chars before, and a certain number of chars behind the part which really interests you, and you also can put bold-codes or similar around the part which is of real interest to you; for both variants, my final remarks for "2) X and (Y or Z)" apply again.

_________________________

5) Retrieve all matching lines

(As before, technically, the whole lines of the "real" matches will now be matches.)

In most real-life use cases, you will want to create a "hit table", a list of all the results. Here, we just display that result list, but in practice, most of the time, you would retrieve the output variable for further processing; here, we use the clipboard as output variable, but any other variable will do if further processing will be done within the same script language or if the next, "overtaking" language can retrieve your output variable (which is only sometimes the case).

So what do we need? Regex can only do one search, ie there is no process logic built into this language; as we have seen, it can write the "matches" into an output variable (even into an array, every group being one array element, but we don't need this functionality here).

So we must introduce process logic by our (any) scripting (or programming) language (here AHK, or C#, Python...) and create another file from the outputs of the multiple regex runs. We'll do this now.

Also, we always started our (unique) search on character position 1 of our text, but now, with consecutive searches, with every search result, we must determine at what position our next search will start, so as to "finding", and then writing, the same, very first match / search result, again and again, into our output file. Thus, we start any consecutive (next) search by the end of our previous (last) search, but so that it will, in case, even match the very next line of the text, too, if that line is another match (but since the line breaks don't count, a sp := pos + ml will be ok). Btw, we don't write into that target file but at the very end: for speed reasons again, on every match, we write the match into a variable, and when the list is complete, we write that list to our output file (alternative final output below).

We take our example 1 (X near Y) here, all other examples accordingly:

; first (ie the input) part of the script
; your input instead of mine here:
tb := "C:\AHK\regexvicinitysearch.txt"
x := "yourstring1"
y := "yourstring2"
xy = 30
yx = 20

; and so on, for your input, see numbers 1 to 3 above


; don't touch the rest of the script, except for outcommenting the TWO pos := ... lines which do NOT apply for your case, and for deleting the comment-code of the line you WANT to be executed, of course:

; Second (ie not: input) part of the script:
fileread, tb, %tb%
clipboard = ; initialisation of our output variable (is emptied now)
; target = ; alternatively, initialisation of some dedicated output variable, leaves your clipboard alone
mn = 0 ; initialisation of our var mn which stands in for matchnumber
ml = 0 ; initialisation of var ml which stands in for matchlength
sp = 1 ; initialisation of var sp which stands in for startposition (remember: 1 is default, here needed for the very first run)

LOOP ; As said, we need multiple regex runs (so we need a loop), up to the point where no more matches are found (so then the loop ends):
{ ; begin of loop
; the following steps are repeated with every loop run (iteration):

; at any time, just delete the out-comment code for ONE of the three regex expressions:
; 1) X near Y:
; pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ")|(" y ".{0," yx "}" x ")).*?$", result, sp)
; 2) X near (Y or Z):
; pos := regexmatch(tb, "m`a).*?((" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")).*?$", result, sp)
; 3) X, Y and Z All Together:
; pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")).*?$", result, sp)

if pos = 0 ; no (more) match for the current (!) run/iteration:
{
   break ; break this loop and continue the script below the loop
   ; break breaks/leaves the construct, return breaks/leaves the script
   ; as you can see, I FIRST check for the possible fail, which then breaks the loop anyway,
   ; so this spares me an unnecessary if - else construct; I do this whenever possible
}   ; end of the if contruct; the braces are NOT necessary, since IN the if construct, there is only ONE line here (comments don't count)

; if we have arrived here, pos is NOT 0, so it's a (or another) match, and we will process it now:
++mn ; success, so we increment mn (matchnumber) by 1

ml := strlen(result) ; the length (!) of our match string (which is the whole match line, remember)
; startposition (sp) for the searches is 1 for the very first one (initialisation before (!) the loop),
; but for any other run/iteration, we retrieve, here, AFTER the match, the sp value needed for the NEXT run/iteration:
sp := pos + ml ; this is not concatenation (which would also be possible for numbers as digits) but addition of course

; and the match string again: we could put a msgbox here, as we did in our examples above, but instead we write the result to our output var:
clipboard .= "`n" . result
; clipboard is a variable "clipboard" and the real clipboard at the same time
; somevar .= someothervar is short for: somevar := somevar . someothervar, ie spares the repetition of the first var name
; or with another variable than the clipboard: target := target . "`n" . result or shorter: target .= "`n" . result
; "`n" is a linebreak (string), an alternative would be "`r`n" (at your convenience)
; the dots are concatenation dots, other languages use + or & or other signs for combining elements
; result is the output variable of the current regex run
; := or .= is variable assignement, hence strings are within "", variables are not
; so the operation is: we ADD a linebreak and the content of "result" TO our existing target var (which is (the) clipboard here)
; since for the very first successful run of the loop, the target var is empty, our very first "`n" is unwanted and will be deleted later
} ; END OF LOOP

if mn = 0 ; matchnumber 0 means: no match at all (ie not only in the current run anymore)
{
   ; no matches means the target var is as empty as it had been before
   msgbox, Not found. Script ends here.
   return ; this breaks the script, so as above, no if - else construct necessary for the rest of the script
} ; more than just one line IN the if construct, so the braces are needed here

; else: there has been at least one match indeed, perphaps more than one, so we show the hit or hit list:
stringtrimleft, clipboard, clipboard, 1 ; but we first delete the very first `n before the first element of the list
msgbox, %clipboard%
;
; alternatively, you would read your variable into some other application or process it further here
;
; alternatively, you would write your variable into a target FILE:
; filedelete, C:\path\yourtargetfile.txt ; no "write to new file" command in AHK, hence deletion of any eponymous file
; fileappend, %clipboard%, C:\path\yourtargetfile.txt ; and then new creation of the file and "appending" of the var content to it
; you see here that for really appending the var content to the existing content of an existing file, you would just leave out the filedelete line

return


Now the second part of the script again, without the explanations:

; Second part, just select ONE of the regex expressions:

fileread, tb, %tb%
clipboard =
mn = 0 ; (matchnumber)
ml = 0 ; (matchlength)
sp = 1 ; (startposition)

LOOP
{
; at any time, just delete the out-comment code for ONE of the three regex expressions:
; 1) X near Y:
; pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ")|(" y ".{0," yx "}" x ")).*?$", result, sp)
; 2) X near (Y or Z):
; pos := regexmatch(tb, "m`a).*?((" x ".{0," x_yz "}(" y "|" z "))|((" y "|" z ").{0," yz_x "}" x ")).*?$", result, sp)
; 3) X, Y and Z All Together:
; pos := regexmatch(tb, "m`a).*?((" x ".{0," xy "}" y ".{0," yz "}" z ")|(" x ".{0," xz "}" z ".{0," zy "}" y ")|(" y ".{0," yx "}" x ".{0," xz "}" z ")|(" y ".{0," yz "}" z ".{0," zx "}" x ")|(" z ".{0," zx "}" x ".{0," xy "}" y ")|(" z ".{0," zy "}" y ".{0," yx "}" x ")).*?$", result, sp)

if pos = 0
break ; the loop

++mn ; matchnumber plus 1
ml := strlen(result)
sp := pos + ml ; for next iteration

clipboard .= "`n" . result
} ; END OF LOOP

if mn = 0
{
msgbox, Not found. Script ends here.
return
}

stringtrimleft, clipboard, clipboard, 1 ; the very first `n
msgbox, %clipboard% ; or alternatively:
; filedelete, C:\path\yourtargetfile.txt
; fileappend, %clipboard%, C:\path\yourtargetfile.txt

return

Edit: formatting only
« Last Edit: August 02, 2017, 04:53 PM by ital2 »

cranioscopical

  • Friend of the Site
  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,776
    • View Profile
    • Donate to Member
Thanks for taking the time!

I lost my mind quite some time ago but this helped me to retrieve a small part of it.


Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
Guauuuuuuu !!!!!!!!!

I think this may be useful for this post !

https://www.donation...ex.php?topic=44158.0
 :-*

ital2

  • Member
  • Joined in 2017
  • **
  • default avatar
  • Posts: 115
    • View Profile
    • Donate to Member
Very impertinent thread by kalos: https://www.donation...ex.php?topic=45945.0 considering that his problem over there is more or less explained here, whilst he pretends this thread doesn't exist.

One thing I probably should have added above, lookarounds; they are simpler than what I did above, but are also less powerful in the above use case, and you will need to know exactly what they are able to do in your specific regex flavor; just with Goyvaerts and .NET, lookarounds really excel (and can become complicated then).

Since the only problem for "regular" lookarounds is to memorize how to write them, I did myself a favor and wrote it down, once for all; here it is:


matchtext
wanted(text) (which is not included in the match)
notwanted(text) (which is not included in the match)

lookbehind/lookahead: (?...) ; does NOT count as a capturing group
then < for behind (memorize it as a backarrow) and nothing for ahead
then = for wanted and ! for notwanted

hence:

lookbehind:
(?<=wanted)matchtext
(?<!notwanted)matchtext (as before)
ATTN: NET/JGSoft (Goyvaerts): any regex, but Tcl not at all, and Perl/Python/etc: only fixed-length strings here, and other limitations, alternation only if same lengths, no quantifiers...

lookahead:
matchtext(?=wanted)
matchtext(?!notwanted) (as before)
ATTN:  any regex except lookbehind here, even capturing groups (will capture as normal except in Tcl)


I don't have the slightest idea how processing a 25-giga file would work, for AHK or other script languages; don't know about their respective memory management; perhaps you need a 32-giga work memory (I only have 16, and no file of that size to try for you).

Btw, that expensive TextPipe, etc. doesn't do anything else than combine regex with scripting, but I'm quite sure I'd soon get to its prefabrication limitations, so I'm happy I made the effort to delve into regex basics, instead of buying that, expensive, tool (or Goyvaert's, which at around 150 bucks is much cheaper).

If you know Perl instead of AHK, the above examples will even be much simpler, since, admittedly, AHK's string processing is a nightmare, for all the `s, "s and 's.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Very impertinent thread by kalos: https://www.donation...ex.php?topic=45945.0 considering that his problem over there is more or less explained here, whilst he pretends this thread doesn't exist.

Not really, it's been proven that people very rarely bother to use a forum search engine prior to just firing off a new thread.

Plus the search engine here is pretty basic, (probably with most forums), it's usually better to use Google ... of course that doesn't help if you're not, shall we say, proficient in reducing your search terms to the relevant.

Look Ahead/Behind is what I used in my last few posts of that thread.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: RegEx Vicinity Search / Regular Expressions: Search for Knowledge.
« Reply #5 on: October 18, 2018, 12:00 AM »
...Not really, it's been proven that people very rarely bother to use a forum search engine prior to just firing off a new thread. ...
I was unaware that this was "proven", but it certainly looks that way - judging from what I've seen, anyway. However, even of you used (say) a site: search (as opposed to the cruddy internal search tool), finding and consolidating relevant/related material in the discussion threads is still likely to be an uphill battle and somewhat hit-or-miss.

DCF is often a veritable mine of useful information, with stuff to be found on various subject categories in DCF discussion threads, but a lot of it seems to be buried in or scattered across threads broken into multiple micro-sub-categories. Occasionally, I try to pull these bits and pieces together into specific higher-level category threads to provide a sort of indexed experiential knowledge-point on a specific subject category that I am interested in. The trouble there is that I am the sole author/editor of the index I created, and - as things stand - it can't be edited in a shared or collaborative fashion by other DCF members. This is a spotty, unreliable and inefficient way of accumulating/curating a knowledge base category.
Ideally, we would use a Wiki for those...    :o

ital2

  • Member
  • Joined in 2017
  • **
  • default avatar
  • Posts: 115
    • View Profile
    • Donate to Member
EDIT OF THE ABOVE:

matchtext
wanted(text) (which is not included in the match)
notwanted(text) (which is not included in the match)


BETTER:

You have your:

- matchstring
i.e. the string (IF it matches of course, and that's to be checked first by the regex engine), to be declared a match IF the lookbehing/lookahead condition(s) IS/are met resp. NOT met

- wantedstring(s)
i.e. the string(s) (which are not included in the match)  and which MUST be there (i.e. before resp. after the matchstring), in order to declare the (matching) matchstring a match

- unwantedstring(s)
i.e. as before but if these strings are matched there, your (originally matching) matchstring match is declared UN-successful by the regex engine (i.e. the unwantedstring "must not" be there, in the sense of "is not allowed to be there"; I insist on this fact since in some other languages, "must not" is synonym for "may be there or not, "is optional", and that's NOT the case here)

Also, let's remind of the fact that there can be, in case, a negative or positive lookbehind, and ALSO a negative or positive lookahead, if the regex engine in question doesn't stumble upon such combinations; and that of course, the 3 different strings have nothing to do with each other: the engine
- tries to match the matchstring
- if that's successful, it tries to match the lookbehind
- if that's successful AND does not discard the match (ie in case of a negative lookbehind), it tries to match the lookahead
- if successful (and does not discard the match (ie in case of a negative lookahead), the match is deemed successful, and that'll ONLY be the matchstring then (and that's the reason why the lookbehind/lookahead parentheses () are NOT counted as elements for replacements then: After having served for validating or invalidating the matchstring match, they go back into oblivion, as far as regex is concerned.

I'm insisting on these facts since, in view of lookbehinds/lookaheads being extremely simple, the obvious difficulty people have with them must lie with misconceptions they could have, around them.

/END OF EDIT



EDIT: "!" is the logical "not", so it's logical that this character is used in negative lookarounds; the "?", in regex in general, stands for "0 or 1 (!) occurrences of the preceding (!) element"; for distinguishing simple linebreaks from double / multiple ones, you'll use a complete lookaround, e.g. in "replace (?<!\n)\n(?!\n) by \n\n" which would only replace the single ones by two, but would not multiply the double or multiple ones; this is of interest e.g. for normalizing web downloads where the title for the next paragraph often clings to the previous one; of course you could implement an additional condition of a max line length, in order for that single \n to be (matched >) changed within that given line character number limit, in order for the code not to affect (most) regular sub-paragraph breaks within regular, broader paragraphs of the source material.

(EDIT: Don't forget the lookbehind-"<": it's necessary since in real life, there often would be other string parts, even before the lookbehind, needing disambiguation.)
« Last Edit: December 01, 2018, 12:37 PM by ital2 »

ital2

  • Member
  • Joined in 2017
  • **
  • default avatar
  • Posts: 115
    • View Profile
    • Donate to Member
"Ideally, we would use a Wiki for those...". Not.