topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday December 12, 2024, 5:21 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Extract REGEX matches from multiple text files  (Read 61178 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #25 on: August 05, 2018, 01:44 PM »
Why is it useless? It's exact representation apart from the fact that are more irrelevant text around.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #26 on: August 05, 2018, 01:55 PM »
You can check this as well:


,["t--ddbPTeIsNI","iGTzEhwhMx4U","r-iGTzEhwhMx4U",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",1234125123l,[null,null,"inline"]
]
,["num_cols",14351435,[null,null,null,2.0]
]
,["max_timing",235123512,[null,null,null,2500.0]
]
,["check_parent_card",143512122,[null,null,null,null,1]
]
,["counterfactual_logging",213513212412,[null,null,null,null,0]
]
]
]
,["t--ddbPTeIsNI","iLS0pb0OlVDE","r-iLS0pb0OlVDE",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",4311231235,[null,null,"inline"]
]
,["num_cols",12341241234,[null,null,null,2.0]
]
,["max_timing",23512351223,[null,null,null,2500.0]
]
,["check_parent_card",5235123412,[null,null,null,null,1]
]
,["counterfactual_logging",12351251212,[null,null,null,null,0]
]
]
]
,["t--ddbPTeIsNI","ibE7thiz85_Y","r-ibE7thiz85_Y",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",124351235,[null,null,"inline"]
]
,["num_cols",623423451,[null,null,null,2.0]
]
,["max_timing",123512351,[null,null,null,2500.0]
]
,["check_parent_card",1235125123,[null,null,null,null,1]
]
,["counterfactual_logging",12351235145,[null,null,null,null,0]
]
]
]

Let's say I want to extract the numbers in the fields ui_mode etc or each of these three separate records.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #27 on: August 05, 2018, 03:24 PM »
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #28 on: August 05, 2018, 03:48 PM »
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.


Is your second line a question?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #29 on: August 05, 2018, 03:51 PM »
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.


Is your second line a question?
Yes, 2 actually.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #30 on: August 05, 2018, 03:54 PM »
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.

We can extract the ui_mode and max_timing and the first column would be the second text in "", ie for the first recond iGTzEhwhMx4U

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #31 on: August 05, 2018, 08:10 PM »
It looks suspiciously like JSON data which PowerShell can handle without using RegEx too much. My bad, wrong type of brackets.

What's the first 10-20 lines of the file?
And the last 20 or so, that'll give us enough, (in theory), to create a small test file.

If it was XML might be able to just use the Select-XML commandlet.

Why is it useless? It's exact representation apart from the fact that are more irrelevant text around.

No, it's your interpretation not the exact data, (raw data), which would show us the structure.
« Last Edit: August 06, 2018, 03:14 AM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #32 on: August 06, 2018, 03:26 AM »
The first lines are:

<?xml version="1.0" encoding="ISO8859-1" ?>
<CATALOG>
 <PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>
 <PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
 </PLANT>

Now this goes on and on and the last lines are:
<PLANT>
 <COMMON>Cardinal Flower</COMMON>
 <BOTANICAL>Lobelia cardinalis</BOTANICAL>
 <ZONE>2</ZONE>
 <LIGHT>Shade</LIGHT>
 <PRICE>$3.02</PRICE>
 <AVAILABILITY>022299</AVAILABILITY>
 </PLANT>
</CATALOG>

But I do not want to work it with Select-XML because it will limit my learning a lot. Instead I want to use REGEX so that I can learn something that can be applied to many other situations.
I believe I need to learn in PowerShell:
1) how to read file
2) how to search for a regex, store it in a variable then perform another regex search in that variable and return a part of the match or append it in an output file
3) how to search for the next instance of the regex and loop the above
4) all regexes must be multiline
« Last Edit: August 06, 2018, 05:21 AM by kalos »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #33 on: August 06, 2018, 05:23 AM »
,["t--ddbPTeIsNI","iGTzEhwhMx4U","r-iGTzEhwhMx4U",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",1234125123l,[null,null,"inline"]
]
,["num_cols",14351435,[null,null,null,2.0]
]
,["max_timing",235123512,[null,null,null,2500.0]
]
,["check_parent_card",143512122,[null,null,null,null,1]
]
,["counterfactual_logging",213513212412,[null,null,null,null,0]
]
]
]

<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>


A couple of questions:
  • In what way do these totally different types/forms of data relate to each other?
  • Are they from different files?
    • Is the first file some sort of definition file and the second the actual data?
    • Is the first part embedded in a CDATA tag like this: "<![CDATA[ your non-xml-formed data like in the first quote goes here ]]>" ?
  • Please provide the exact filenames.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #34 on: August 06, 2018, 05:31 AM »
it will limit my learning a lot
Well, please first try to learn how to describe your challenge well, a tutorial was linked earlier by 4wd, then we will try to teach you how to best solve your challenge. It may not need regex at all.

A common saying about regexes goes like this: You try to solve a problem with a regex. Now you've got 2 problems...

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #35 on: August 06, 2018, 06:01 AM »
But I do not want to work it with Select-XML because it will limit my learning a lot.

If you are going to work with XML then learn to use the most efficient means available otherwise it isn't learning, (well obviously you'll learn from your mistakes but why be inefficient?).

The same applies to JSON, CSV, etc, etc - learning to use the wrong method to achieve what you want is what cripples you.  To put it simply: Use the right tool for the job.

From the Programmer Humor thread, a very eloquent StackOverflow answer that illustrates what Ath said about regexes above.

Do you have one of these files that is considerably less than 25GB, that contains no proprietary data, that you can 7-Zip, (being plain text with lots of repetition it should compress well), then upload to GDrive or somewhere, and then provide us with a link?
« Last Edit: August 06, 2018, 06:17 AM by 4wd »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #36 on: August 06, 2018, 07:01 AM »
Mmm, I see.

Well guys, the data is what I posted in my last post (Plants), these are three sample records and they keep repeating (with different values).

How do I parse this in the most easy way?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #37 on: August 06, 2018, 07:46 AM »
By the way, is there a way to do 'find next' in Powershell without having to find all matches and create an array? I imagine the latter is very RAM consuming.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,190
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #38 on: August 06, 2018, 07:47 AM »
it will limit my learning a lot
Well, please first try to learn how to describe your challenge well, a tutorial was linked earlier by 4wd, then we will try to teach you how to best solve your challenge. It may not need regex at all.

A common saying about regexes goes like this: You try to solve a problem with a regex. Now you've got 2 problems...

Totally agree! One of our systems at work is based on Regex, and its very powerful, but very touchy to change!  ;D

If you're going to use regex, though its not necessary to purchase a tool like regexbuddy, I'd recommend it to preserve your sanity!

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #39 on: August 06, 2018, 08:13 AM »
Yeah, I want to use RegEx to be honest. But I struggle to find a way to do it.

The first line of the data contains an ID. So I can store all these IDs in an array.
Then, for each entry in the array, I will be able to match some regex and output them.

The problem is that the data gets into so many deep tree branches that it gets hard to isolate them.

Mmmmm! Now I got an idea.
If I could convert the XML file in a flat structured file, where each line will display the attribute name and value (as it normally does in XML), but it will also display the attributes and values from all the above hierarchy!

That way, it will be much more manageable, because I will be able to isolate and process specific lines.

Any script that can do this?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #40 on: August 06, 2018, 10:55 AM »
By the way, this script looks amazing (From: https://www.codeproj...0/powershell-and-xml):

PS C:\> $xml = (Get-Content file.xml)
PS C:\> $xml = [xml](Get-Content file.xml)
PS C:\> $xml.SelectNodes("/employees/employee")

id                                      name                                    age
--                                      ----                                    ---
101                                     Frankie Johnny                          36
102                                     Elvis Presley                           79
301                                     Ella Fitzgerald                         102

But I cannot make it work for my file. Any hint?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #41 on: August 06, 2018, 11:36 AM »
Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.

Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?


4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #42 on: August 06, 2018, 11:37 AM »
But I cannot make it work for my file. Any hint?

Yeah, as Ath suggested, your XML  contains CDATA so you have to read that separately.

https://stackoverflo...file-with-powershell

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #43 on: August 06, 2018, 12:50 PM »
But I cannot make it work for my file. Any hint?

Yeah, as Ath suggested, your XML  contains CDATA so you have to read that separately.

https://stackoverflo...file-with-powershell

I will try but can you help me with the below:

Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.

Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?



Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #44 on: August 06, 2018, 02:14 PM »
If I could convert the XML file in a flat structured file
Converting your .xml to .csv is a quite easy one-liner in Powershell, assuming a single xml file, into a single .csv file:
Code: PowerShell [Select]
  1. $([xml]$(Get-Content .\kalos-data1.xml)).SelectNodes("/CATALOG/PLANT")|Export-Csv .\kalos-data1.csv
The reason it wasn't working for you was probably that you didn't account for xml to be case-sensitive.

Parsing an .xml file containing CDATA tags with a regex is close to impossible to get right, as nearly any content is possible inside such CDATA tag, including valid xml...


Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.
Please listen to people with more (programming) experience than you have, you are really trying to hammer round screws into square holes here, don't do that, you'll hurt yourself.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #45 on: August 06, 2018, 02:25 PM »
I don't understand what CDATA is.

My xml file contains tons of tags, ie text inside <>, in a complex hierarchy.
Apart from that, it contains values both inside the <>, in the format of <someTag someID="SomeValue"> and in the format of <someTag>SomeValue<\someTag>

1) I don't know what the total number and hierarchy of tags is. So can I select ALL nodes under the whole hierarchy?
2) Will PS process the both formats of values above?

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #46 on: August 06, 2018, 02:27 PM »
Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.
Please listen to people with more (programming) experience than you have, you are really trying to hammer round screws into square holes here, don't do that, you'll hurt yourself.

OK but I would be highly interested to learn how to do the below?
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,629
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #47 on: August 06, 2018, 02:45 PM »
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?
Well, the trouble is you'll have to do it in some script or programming language, as regex is actually a selection mechanism using pattern matching ('regular expressions').
For such a task I'd advise to use sed, the Stream EDitor, originally from unix, but also available for Windows, that is built for jobs like this.
I've made a Sed-Tester tool for NANY a couple years back, find it from the link below this post and try it out, it includes sed.exe, and has a link to sed documentation in the gui.
You can also continue on the PS-script 4wd gave here earlier, but that doesn't go through the data sequentially in the way sed does.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,824
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #48 on: August 06, 2018, 03:17 PM »
Stream EDitor

Very interesting tool, thanks!

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #49 on: August 06, 2018, 04:41 PM »
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that?

That is what the code I originally posted does.

Last, how to find the next regex match in the file?

Add another Select-String line with the next RegEx.

I'm going to give up until we get at least sensible raw data and what the expected output should look like ...
« Last Edit: August 07, 2018, 12:32 AM by 4wd »