topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 11:03 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Extract REGEX matches from multiple text files  (Read 52962 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #50 on: August 07, 2018, 04:16 AM »
Last, how to find the next regex match in the file?
Add another Select-String line with the next RegEx.

No, I don't mean a different regex. I mean the same regex. The same regex may have multiple matches in one file. How do I make the script to find the first instance, do stuff, then find the second instance, do stuff etc?

Also, how do I make . to include newline?

Also, what is | % { $_.Matches } | % { $_.Value } >> $outfile exactly?
I don't know what % and { $_. and Value are?

Also, how do I return a specific part from regex? In normal regex text editors, you put the part in parentheses and then you replace them with \1 etc. How do I do it in Powershell?

Also, I should be able to figure this out myself, but I am looking for a neat code and I can only manage to come up with messy stuff: is there a script to delete lines not containing specific literate phrases? E.g. not containing 'lue } >> $ou' without having to go through each character to check if it needs escaping or not.
« Last Edit: August 07, 2018, 07:29 AM by kalos »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #51 on: August 07, 2018, 08:50 AM »
You obviously don't understand what we wrote earlier. I already gave 2 (two) possible ways how to handle that. And there are other solutions too.

Like 4wd, I'll pick up again after you start giving complete examples of real pieces of a data file with exact descriptions of what you want to achieve, not of descriptions of how you think it should be handled/solved.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #52 on: August 07, 2018, 09:01 AM »
I want answers to specific questions, not a complete solution. It is not possible for anyone external to offer a complete solution because the source data cannot be shared.
Also, most of my questions are for my own understanding and may not directly relate to the specific problem.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #53 on: August 07, 2018, 09:29 AM »
most of my questions are for my own understanding and may not directly relate to the specific problem.
You obviously don't understand what we wrote earlier. I already gave 2 (two) possible ways how to handle that. And there are other solutions too.
To spell it out, again:
  • Use a programming or script language
  • Use sed
  • And a third solution to add to the confusion: Use awkw

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #54 on: August 07, 2018, 10:05 PM »
It is not possible for anyone external to offer a complete solution because the source data cannot be shared.

And there you go, a perfect illustration of the main problem.

Despite asking for raw data several times starting from page 1, (and getting various snippets and interpretations), we find out on page 3 that the data can't be shared.

Any reason why you couldn't have told us this when it was first asked?

Also, what is | % { $_.Matches } | % { $_.Value } >> $outfile exactly?

|
A Pipe

%
ForEach-Object

$_
Object passed through pipe.

Matches
A Property of the object.

Value
A Property of the object.

You do know that they have this thing called Google, right?

Let's say I want to extract the numbers in the fields ui_mode etc or each of these three separate records.

Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '^.*"ui_mode",(\d+).*$' -a | % { $_.Matches } | % { $_.Groups[1].Value } >> K:\out.txt }

Patience_Meter_Bar.jpg
« Last Edit: August 07, 2018, 10:26 PM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #55 on: August 08, 2018, 08:59 AM »
gci *.txt | % { sls $_.Name -Pattern '^.*"ui_mode",(\d+).*$' -a | % { $_.Matches } | % { $_.Groups[1].Value } >> K:\out.txt }

I still struggle very much with this and Google does not help. The main source of confusion I believe is the fact that I don't know if a term in the command is a random variable name or if it is a specific variable which is part of PS core. Also, another thing is that I may be able to Google "what does % mean in powershell" to find out what a specific symbol means but if they are part of another word like $_.Matches, it becomes confusing.

So the gci command will grab all the files that match *.txt in the directory.
The pipe means that we run sequentially another command.
The sls command means that it matches the regex inside the input $_.Name. What is that? I read "The $_ variable holds a reference to the current item being processed." But I don't understand what that means. Any idea?
Then, we move on to the command % which is basically a for-each loop. So for each of the Matches found with the previous command, we run another for each loop, the Groups[1].Value. I don't know what that is either. Any idea?
Finally we append to out.txt.

Are the terms Name, Matches, Groups, Value, something standard in Powershell or they are random variables names?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #56 on: August 08, 2018, 01:39 PM »
Name, Matches, Groups, Value, something standard in Powershell or they are random variables names?
4wd did a fine job of explaining all the special chars / shortcuts he used in the post just above yours.
Name is the standard attribute holding, wait for it..., the name of the object.
Matches, Groups and Value are attributes of the Regex resulttype Match (.NET, the base of Powershell, as I explained earlier), containing an entire match, all groups found in the regex provided, with Groups[0] holding the entire result and Group[1] etc. reflecting the standard regex \1 result etc, and value holding the found value, usually the same as Groups[0].
« Last Edit: August 08, 2018, 01:53 PM by Ath »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #57 on: August 08, 2018, 10:05 PM »
4wd did a fine job of explaining all the special chars / shortcuts ...

Hey now, that's going a bit far ...

@kalos:

Powershell Objects
The Complete Guide to Powershell Punctuation
Powershell Commands
Powershell Syntax

Every piece of script I've put in this thread resulted from one simple search - especially the last piece.

I search Google until I get close enough to the answer I want, it's how I learnt Powershell, (and I'm nowhere near being proficient in it).  If you're not getting what you want you're probably being too specific, change your search parameters, because it's a high probability that someone else has already asked it and most likely had it answered.

I suggest you read the link I posted earlier: How To Ask Questions The Smart Way

[Insert deity] knows I'm guilty of asking stupid questions but at least I try to provide all the information required/asked to help whoever it is that may decide to offer help, whether it be data, required output, the right question, cup of coffee, chocolate filled croissant, new car ... (wait, I draw the line there), etc.

And we still have no idea how your various bits of data file relate to each other.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #58 on: August 09, 2018, 10:53 AM »
That's good info, so I will have to have a good read on these to be able to understand PS.

I am better understanding by studying examples so I would appreciate your help with this.

PS: This is not related to the initial data file I wanted to process.

Any idea why this does not work?
Get-Content *.xml | Out-String

I want then to append to a file all the matches of a regex1 or regex2. Any idea?

Also, any idea on how to extract specific values from xml nodes?
I type  select-xml -path *.xml -xpath "/html:html/html:products/html:product/html:referenceData/html:pct/html:productId" and it doesn't work

Something else, I want to parse the text of a file and use -match on it, but I cannot figure out how to do it, it's so embarrassing!
gc *.xml -match *regex*
does not work :(
« Last Edit: August 09, 2018, 12:07 PM by kalos »

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #59 on: August 09, 2018, 07:59 PM »
Any idea why this does not work?
Get-Content *.xml | Out-String

I have a better idea, you tell us why you think it doesn't work.

gc *.xml -match *regex*
does not work :(

Code: PowerShell [Select]
  1. Get-Help Get-Content

You tell us why it doesn't work.
« Last Edit: August 09, 2018, 08:57 PM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #60 on: August 10, 2018, 05:32 AM »
Any idea why this does not work?
Get-Content *.xml | Out-String

I have a better idea, you tell us why you think it doesn't work.

gc *.xml -match *regex*
does not work :(

Code: PowerShell [Select]
  1. Get-Help Get-Content

You tell us why it doesn't work.

Mmm I don't know why Get-Content *.xml | Out-String does not work to be honest.
I read at https://ss64.com/ps/out-string.html:
Send the content of Test1.txt to the console as a single string:
PS C:\> get-content C:\docs\test1.txt | out-string
So, shouldn't it work?

As for gc *.xml -match *regex* it seems that -match does not go after gc, but how do I connect/pipe these?
I had no clue, but I found in an irrelevant place on the internet this:
(Get-Content .\input.txt) -join "`r`n"
How should I know that I need to parenthesise the first object? Where does it say that in PS manual?
« Last Edit: August 10, 2018, 07:37 AM by kalos »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #61 on: August 10, 2018, 07:55 AM »
I think he was asking, what do you mean doesn't work?  You just said it didn't work and didn't give any indication of what happened.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #62 on: August 10, 2018, 08:38 AM »
They were to get a couple of points across, which were completely missed.

1. His guess as to why the first doesn't work will always be better than whatever we could come up with because we never, ever get enough information/context to make an informed guess/opinion.

2. He doesn't read what has already been given because the answer is in this thread.

PS. Sorry mouser ...  :-\

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #63 on: August 10, 2018, 09:43 AM »
2. He doesn't read what has already been given because the answer is in this thread.

PS. Sorry mouser ...  :-\

Oh sorry but due to my learning difficulty I need to be pinpointed to the exact thing.
I am developing my PS understanding though and it seems very powerful  :Thmbsup:

I wrote this script:
(gc *.txt)  -replace "regex1(.+?)", "`$1" >> out

It replaces the regex with the reference from the regex and outputs to a new file. Not exactly what I want it to do.
I want to output the reference, any idea how to do that?
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.

I believe piping commands does not achieve this. I think piping is about getting the output object from the previous command and feed it to the next command. However, I want the various regex matches to work on the original object sequentially. This is a bit tricky, any idea?


Also, can you tell me please how to find and select and append values from multiple xml nodes knowing their XPath?
I do that and it doesn't work:
Select-Xml -Path "*.xml" -XPath "/html:book/html:Entity" >> out
Also this doesn't work:
PS H:\> [xml]$Types = get-content *.xml
PS H:\> select-xml -xml $Types -xpath "//html:Entity"
select-xml : Namespace Manager or XsltContext needed. This query

Thanks!
« Last Edit: August 10, 2018, 11:26 AM by kalos »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #64 on: August 10, 2018, 01:37 PM »
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.
You have to make clear whether the results from the separate queries have any positional relation to each other, or can the queries be run one after the other and the output of the second, third, etc., runs appended to the first regex run?

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #65 on: August 10, 2018, 08:21 PM »
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.
You have to make clear whether the results from the separate queries have any positional relation to each other, or can the queries be run one after the other and the output of the second, third, etc., runs appended to the first regex run?

Or to put it another way:
  • STOP trying to describe what you want to happen, (because you're not very good at it).
  • Provide sufficient sample input data of any kind whether it's real or made up (as long as it represents the real format).
  • Provide an example of the output using the input data that shows what you're trying to achieve.
  • PROVIDE relevant feedback, something you consistently fail to do, (eg. 1, 2, 3, 4, etc, etc, etc).

DO NOT give us separate examples of two disparate data types without showing how they relate to one another, (eg. XMLw <> and CDATA []), within the same file.

Until you can do that we're just running around in circles and it's pointless continuing this thread, as such I'm out of here until the above happens.
« Last Edit: August 11, 2018, 06:28 AM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #66 on: August 13, 2018, 05:03 AM »
OK, so the input is:
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>

The above continues to prod2 etc.

The output of the data would be:
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
Then a new line would start with:
prod2; etc


However, I want to convert the input data in a string, because, I may need to match longer substrings than eg "<html:classificationType>(.+?)</html:classificationType>"
Also, I think there may be duplicates for each prod, e.g. more than one assignedDate node with different values, so MatchAll would be best.
thanks!
« Last Edit: August 13, 2018, 05:25 AM by kalos »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #67 on: August 13, 2018, 07:18 AM »
So it's always xml and it's always that schema?  And you're just worried about duplicates?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #68 on: August 13, 2018, 08:45 AM »
So it's always xml and it's always that schema?  And you're just worried about duplicates?


Yeah, for now it looks like that.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #69 on: August 13, 2018, 10:05 AM »
And one last question... when you say duplicate, you mean the whole record is duplicated?  Or just some of the fields, i.e. productID or prod id?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #70 on: August 13, 2018, 10:47 AM »
And one last question... when you say duplicate, you mean the whole record is duplicated?  Or just some of the fields, i.e. productID or prod id?

Some fields, eg there may be more than one assignedDate value, so the script will need to process these additional fields for the same prod.

The pseudocode I am looking for is like this:
1) search for the first 'prod' section of the file, convert it to single line, extract the appropriate regex (all matches) one after the other (that's why I want to specify the all the regex matches that I want the script to search for when scanning the line, as I am not sure which order they will be - it shouldn't change but just in case)
2) then find the next 'prod' section in the file, convert it to single line and put it in a line below the previous, then extract the regexes one by one

Any hint?

I tried to use ¦ to add OR regex matches, but I think it didn't work.
« Last Edit: August 13, 2018, 11:39 AM by kalos »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #71 on: August 13, 2018, 02:05 PM »
OK, so the input is:
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>

The above continues to prod2 etc.

The output of the data would be:
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
Then a new line would start with:
prod2; etc


That finally makes some sense. Here is an example solution for putting that into a .csv formatted file.
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)

  But, earlier in this thread you wrote this:
<CATALOG>
 <PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>
Well guys, the data is what I posted in my last post (Plants),
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?


However, I want to convert the input data in a string, because, I may need to match longer substrings than eg "<html:classificationType>(.+?)</html:classificationType>"
You are talking b.s. here.


Also, I think there may be duplicates for each prod, e.g. more than one assignedDate node with different values, so MatchAll would be best.
This doesn't make sense without an example, and MatchAll is inappropriate here.


extract the appropriate regex
PLEASE STOP TELLING US HOW TO SOLVE YOUR CHALLENGE!
(This could have been bigger and in red, but I'm trying to stay nice, so I didn't)
If you want to learn regex, go get a book or on-line course, there are plenty here and here, and stop feeding us xml.

When handling XML, no regexes are usually involved, unless the data elements contain 'complex', somewhat structured, data that needs to be broken down.

I have this assignment for you:
- read the entire thread from OP to the end and formulate an answer to all unanswered questions we asked you. (Just quote the question and type the answer below the quote)
After all the answers are given you can ask 1 new question. As 4wd already stated, and you said yourself but in other words, you aren't good in answering questions, but it is required for other people to help you solve your challenge/quest.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,641
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #72 on: August 13, 2018, 07:41 PM »
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)

Or by installing an updated XML module which will give you Remove-XmlNamespace

Still not convinced it's sufficient information since somewhere there should have been a Namespace declaration I would have thought, he's only giving the information for a record within the file.

Well guys, the data is what I posted in my last post (Plants),
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?

PS: This is not related to the initial data file I wanted to process.

So far:
  • XML,
  • XML with Namespace, (file sizes unknown),
  • CDATA

... and counting ...

It's another problem, jumping from one thing to another without getting any one thing completed.
« Last Edit: August 13, 2018, 08:21 PM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,823
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #73 on: August 20, 2018, 08:37 AM »
I don't understand why you do not answer my specific questions, regardless of the source data format and the desired output. Is what I am asking not possible to be done with Powershell?

For example, I want to perform a regex match that will output all matches of regex1 and regex2 and regex3.

How can I do that?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,612
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #74 on: August 20, 2018, 08:59 AM »
How can I do that?
That's why we asked more specific questions, but you never answered them.
So then I gave you the assignment of answering all our unanswered questions, but you haven't done that up until now, so basically, we are waiting (but not holding our breath) for your answers, before accepting new questions. :(