topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 29, 2024, 2:33 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Extract REGEX matches from multiple text files  (Read 52993 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Extract REGEX matches from multiple text files
« on: August 03, 2018, 04:14 AM »
Hello!

Which tool/scripting language can I use to match REGEX strings and extract them to a new file or to just delete the non matching strings from multiple text files?

It would help to be easy to write as I cannot learn complicated syntax!

Also, ideally it should work via command line as I am talking about many many files which can be huge.

Thanks!

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #1 on: August 03, 2018, 05:49 AM »
Code: PowerShell [Select]
  1. $outfile = 'K:\output.txt'
  2. $regex = '^Function.+'
  3. $items = Get-ChildItem -Path *.ps1       # *.txt , *.foo , *.whatever
  4. for ($i = 0; $i -lt $items.Count; $i++) {
  5.   Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
  6. }
« Last Edit: August 03, 2018, 06:05 AM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #2 on: August 03, 2018, 07:04 AM »
Thanks but I am not familiar with this language.

I think Autohotkey would be more appropriate for me as it has simple structure (what are the $i = 0; $i -lt $items.Count; $i++ they look like Aramaic to me :P)

Can you tell me the commands/structure in AHK to do this:

search for a regex1, extract regex2 from regex1 (ie append to a new file), and continuously loop until there is no other regex1 found

Also, how do I specify an exact string in regex? I want to specify the string <dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">
and I don't want to escape every single symbol etc.
Is there a way to search for an exact string literally?
« Last Edit: August 03, 2018, 07:15 AM by kalos »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #3 on: August 03, 2018, 07:14 AM »
(what are the $i = 0; $i -lt $items.Count; $i++

While i < (number of items)
  blah blah blah
  i = i + 1
Wend

EDIT: It's actually a For ... Next loop but same result.

For i = 0 to (number of items - 1) step 1
  blah blah blah
Next
« Last Edit: August 05, 2018, 05:35 AM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #4 on: August 03, 2018, 07:25 AM »
Also, the file I want to manipulate is 25GB! Is there a strategy to handle this?

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #5 on: August 03, 2018, 07:26 AM »
Code: PowerShell [Select]
  1. $outfile = 'K:\output.txt'
  2. $regex = '<dsf:tsdfgd trsdfge=\"urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d\" id=\"OsdfgsdfD\">'
  3. $items = Get-ChildItem -Path *.txt       # *.txt , *.foo , *.whatever
  4. for ($i = 0; $i -lt $items.Count; $i++) {
  5.   Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
  6. }

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #6 on: August 03, 2018, 10:52 AM »
Could you tell me please in AHK? I am not familiar with that language, unless you can point me to the explanations of these commands?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #7 on: August 03, 2018, 12:16 PM »
The scriptlanguage used is Microsoft's PowerShell, the aimed successor of cmd with its relatively poor language batch (.bat/.cmd) scripts, that comes standard installed with Win10, Win8.1 and Win8, and can easily be installed on older Windows versions.

Copy the script to a file with .ps1 extension, adjust the 1st line to your desired resultsfile, adjust in the 3rd line *.txt to the extension of your data files, press the Start button and start typing powershell to find that, then run the script from the directory where your data files are.
Largish files are no issue for PowerShell.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #8 on: August 03, 2018, 02:37 PM »
Very interesting!

Do you know a good site that explains the structure of the script you posted and the definition/usage of the commands along with examples?

Also, does this script load the whole text of the file in memory to perform its operations? This will be a problem for a 25GB file

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #9 on: August 03, 2018, 05:15 PM »
Can you explain please word by word this bit:

$items = Get-ChildItem -Path *.txt       # *.txt , *.foo , *.whatever
for ($i = 0; $i -lt $items.Count; $i++) {
  Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
}

Also, I need to append to the output file several regex matches/returns, how do I do that?
Also, if I specify a regex match, how do I specify what I want to be returned from this match?

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #10 on: August 03, 2018, 10:02 PM »
Its pretty standard poweshell, and there are a lot of tutorials online that will explain the commands to you.  Just as a start, # is a line comment, so everything after that on the line is just documentation.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #11 on: August 03, 2018, 10:11 PM »
$items - an arbitrarily named variable
=        - sign signifying equality
Get-ChildItem

Thus $items now equals an array of files in the current folder that match *.txt
$items[0] = firstfile.txt
$items[1] = secondfile.txt
etc
etc
etc

$items.Count  - total number of matching files found

for(){}   - a for loop, $i is a variable that gets incremented by 1 every loop until the total number of matching files is reached

Thus loop through all the files in the array performing the following on every file:

Select-String -Path $items[$i] -Pattern $regex -AllMatches

Search each file for matching RegEx pattern, get all matches.

| % { $_.Matches } | % { $_.Value } >> $outfile

RegEx matches are piped into a ForEach-Object loop, (shorthand notation). For each regex match, pipe it's value to the output file in append mode.

Don't actually need to escape the " in the RegEx either:
Code: PowerShell [Select]
  1. $regex = '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">'
Will also work.

Same as the 6 lines above without assigned variables or a for loop:
Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">' -a | % { $_.Matches } | % { $_.Value } >> K:\out.txt }
« Last Edit: August 07, 2018, 09:06 PM by 4wd »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #12 on: August 04, 2018, 07:03 AM »
Also, I need to append to the output file several regex matches/returns, how do I do that?
Also, if I specify a regex match, how do I specify what I want to be returned from this match?
Why did you leave these rather important 'details' out in your original question?

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #13 on: August 04, 2018, 07:37 AM »
Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information is obtained ...

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #14 on: August 04, 2018, 09:59 AM »
Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information is obtained ...

And you're a lot more patient than I, in regards to someone not wanting to do the due diligence after you've given them the solution.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #15 on: August 04, 2018, 11:54 AM »
Yeah, kind of overstepped my G.A.S. limit but it provided a little mental exercise.

Normally would have left it at my first post but I was a little bored ...

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #16 on: August 04, 2018, 03:46 PM »
Thanks but I struggle to follow. I find AHK much more straight forward. But how can I make it work with a 25GB?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #17 on: August 04, 2018, 04:59 PM »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #18 on: August 05, 2018, 03:10 AM »
In the meantime I will read https://www.itprotod...course-you-can-learn
I expect you comprehend what's written there, it seems quite suited for powershell n00bs.

Thanks but I struggle to follow. I find AHK much more straight forward. But how can I make it work with a 25GB?
Please stop asking for an AHK solution, those that participated here sofar aren't going to provide it, as a perfect solution is already provided.

If you had tried the script at actual data, you wouldn't have asked again about the 'measily' 25 GB files; yes, ofcourse it will take some time to process, but so does an 18.8 minute powershell crash course.
Powershell is built on the foundation of .NET, so it knows how to handle files efficiently.

Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information is obtained ...
I know, I know, I'm just trying to educate someone (again, but it doesn't seem to be picked up much), see my quote below...

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #19 on: August 05, 2018, 05:46 AM »
$items - an arbitrarily named variable
=        - sign signifying equality
Get-ChildItem

Thus $items now equals an array of files in the current folder that match *.txt
$items[0] = firstfile.txt
$items[1] = secondfile.txt
etc
etc
etc

$items.Count  - total number of matching files found

for(){}   - a for loop, $i is a variable that gets incremented by 1 every loop until the total number of matching files is reached

Thus loop through all the files in the array performing the following on every file:

Select-String -Path $items[$i] -Pattern $regex -AllMatches

Search each file for matching RegEx pattern, get all matches.

| % { $_.Matches } | % { $_.Value } >> $outfile

RegEx matches are piped into a ForEach loop, (shorthand notation). For each regex match, pipe it's value to the output file in append mode.

Don't actually need to escape the " in the RegEx either:
Code: PowerShell [Select]
  1. $regex = '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">'
Will also work.

Same as the 6 lines above without assigned variables or a for loop:
Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">' -a | % { $_.Matches } | % { $_.Value } >> K:\out.txt }


That is very helpful thanks!

From what I have understood, the script will first scan its own folder where it exists, for all the txt files present and process them one by one in an array. Actually I think I can skip that bit if it can process the whole 25GB txt file at once.

As for the actual regex matches, what I would actually like it to do is to:
- scan the source file for a regex(A)
- finding the first instance of regex(A), it would store it in a variable and search another regex(B) inside that variable.
- then I have a couple more regex matches that I need it to store in that variable and output specific things from these regex matches inside the initial regex(A). By output I mean write sequencially line by line in an output file.
- then the loop will continue with the next regex(A) match inside the source file, and store it in a variable, and search for the same regex(B) etc matches inside that variable and output parts of those regex matches in the output file.

Sounds very basic and simple. Can you tell me what commands I need to write something like that please?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #20 on: August 05, 2018, 06:20 AM »
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.
« Last Edit: August 05, 2018, 06:34 AM by Ath »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #21 on: August 05, 2018, 09:35 AM »
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.

Skunds like the same problem that I face at work.  Except I get paid to deal with the frustration.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #22 on: August 05, 2018, 11:03 AM »
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.

Indeed, I now realised it!
I will try to provide an example in a bit.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #23 on: August 05, 2018, 11:09 AM »
The format of the data is like that (the only difference is that the data is multiline rather than single line as in this example):

prod1
blah
specs=a
blah
price=b
blah
prod2
blah
specs=c
blah
price=d
blah

So I want the output to be a csv like:
prod1; a; b
prod2; c; d

So I was thinking first a regex to highlight/save in a variable the first area of the text that belongs to a prod, which is the the first six lines (I cannot use the number of lines to distinguish them as they vary).
Then it would extract a and b from that variable by matching the specs and price regex 'within' prod1 variable, so that I can distinguish them from prod2.
And then loop to complete the conversion.

Hope this helps?

So my understanding is that I cannot search for a regex that will match "specs=.+?" or something because I won't be able to distinguish this for prod1, prod2, etc.
At the same time, I cannot match the regex "prod1.+specs=.+?" because I don't know the exact text for prod1 (it's an xml attribute that is called prodID, but the value can be anything).

Do you have any idea on how to process this?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #24 on: August 05, 2018, 11:58 AM »
This example is totally useless. >:(

Please extract 2 or 3 of those (complete) product records from your actual data file. Optionally replace confidential stuff (data, prices)with aaaaa, bbbbb, 1.23, etc., but leave the structure exactly as it is!
Then post that here.