DonationCoder.com Forum

Main Area and Open Discussion => General Software Discussion => Topic started by: kalos on August 03, 2018, 04:14 AM

Title: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 04:14 AM
Hello!

Which tool/scripting language can I use to match REGEX strings and extract them to a new file or to just delete the non matching strings from multiple text files?

It would help to be easy to write as I cannot learn complicated syntax!

Also, ideally it should work via command line as I am talking about many many files which can be huge.

Thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 03, 2018, 05:49 AM
Code: PowerShell [Select]
  1. $outfile = 'K:\output.txt'
  2. $regex = '^Function.+'
  3. $items = Get-ChildItem -Path *.ps1       # *.txt , *.foo , *.whatever
  4. for ($i = 0; $i -lt $items.Count; $i++) {
  5.   Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
  6. }
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 07:04 AM
Thanks but I am not familiar with this language.

I think Autohotkey would be more appropriate for me as it has simple structure (what are the $i = 0; $i -lt $items.Count; $i++ they look like Aramaic to me :P)

Can you tell me the commands/structure in AHK to do this:

search for a regex1, extract regex2 from regex1 (ie append to a new file), and continuously loop until there is no other regex1 found

Also, how do I specify an exact string in regex? I want to specify the string <dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">
and I don't want to escape every single symbol etc.
Is there a way to search for an exact string literally?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 03, 2018, 07:14 AM
(what are the $i = 0; $i -lt $items.Count; $i++

While i < (number of items)
  blah blah blah
  i = i + 1
Wend

EDIT: It's actually a For ... Next loop but same result.

For i = 0 to (number of items - 1) step 1
  blah blah blah
Next
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 07:25 AM
Also, the file I want to manipulate is 25GB! Is there a strategy to handle this?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 03, 2018, 07:26 AM
Code: PowerShell [Select]
  1. $outfile = 'K:\output.txt'
  2. $regex = '<dsf:tsdfgd trsdfge=\"urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d\" id=\"OsdfgsdfD\">'
  3. $items = Get-ChildItem -Path *.txt       # *.txt , *.foo , *.whatever
  4. for ($i = 0; $i -lt $items.Count; $i++) {
  5.   Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
  6. }
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 10:52 AM
Could you tell me please in AHK? I am not familiar with that language, unless you can point me to the explanations of these commands?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 03, 2018, 12:16 PM
The scriptlanguage used is Microsoft's PowerShell, the aimed successor of cmd with its relatively poor language batch (.bat/.cmd) scripts, that comes standard installed with Win10, Win8.1 and Win8, and can easily be installed on older Windows versions.

Copy the script to a file with .ps1 extension, adjust the 1st line to your desired resultsfile, adjust in the 3rd line *.txt to the extension of your data files, press the Start button and start typing powershell to find that, then run the script from the directory where your data files are.
Largish files are no issue for PowerShell.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 02:37 PM
Very interesting!

Do you know a good site that explains the structure of the script you posted and the definition/usage of the commands along with examples?

Also, does this script load the whole text of the file in memory to perform its operations? This will be a problem for a 25GB file
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 03, 2018, 05:15 PM
Can you explain please word by word this bit:

$items = Get-ChildItem -Path *.txt       # *.txt , *.foo , *.whatever
for ($i = 0; $i -lt $items.Count; $i++) {
  Select-String -Path $items[$i] -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } >> $outfile
}

Also, I need to append to the output file several regex matches/returns, how do I do that?
Also, if I specify a regex match, how do I specify what I want to be returned from this match?
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 03, 2018, 10:02 PM
Its pretty standard poweshell, and there are a lot of tutorials online that will explain the commands to you.  Just as a start, # is a line comment, so everything after that on the line is just documentation.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 03, 2018, 10:11 PM
$items - an arbitrarily named variable
=        - sign signifying equality
Get-ChildItem (https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-childitem?view=powershell-6)

Thus $items now equals an array of files in the current folder that match *.txt
$items[0] = firstfile.txt
$items[1] = secondfile.txt
etc
etc
etc

$items.Count  - total number of matching files found

for(){}   - a for (https://ss64.com/ps/for.html) loop, $i is a variable that gets incremented by 1 every loop until the total number of matching files is reached

Thus loop through all the files in the array performing the following on every file:

Select-String -Path $items[$i] -Pattern $regex -AllMatches

Search each file for matching RegEx pattern, get all matches.

| % { $_.Matches } | % { $_.Value } >> $outfile

RegEx matches are piped into a ForEach-Object (https://ss64.com/ps/foreach-object.html) loop, (shorthand notation). For each regex match, pipe it's value to the output file in append mode.

Don't actually need to escape the " in the RegEx either:
Code: PowerShell [Select]
  1. $regex = '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">'
Will also work.

Same as the 6 lines above (https://www.donationcoder.com/forum/index.php?topic=45945.msg422126#msg422126) without assigned variables or a for loop:
Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">' -a | % { $_.Matches } | % { $_.Value } >> K:\out.txt }
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 04, 2018, 07:03 AM
Also, I need to append to the output file several regex matches/returns, how do I do that?
Also, if I specify a regex match, how do I specify what I want to be returned from this match?
Why did you leave these rather important 'details' out in your original question?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 04, 2018, 07:37 AM
Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information (https://d.cxcore.net/Eric%20S%20Raymond/How%20To%20Ask%20Questions%20The%20Smart%20Way.pdf) is obtained ...
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 04, 2018, 09:59 AM
Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information (https://d.cxcore.net/Eric%20S%20Raymond/How%20To%20Ask%20Questions%20The%20Smart%20Way.pdf) is obtained ...

And you're a lot more patient than I, in regards to someone not wanting to do the due diligence after you've given them the solution.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 04, 2018, 11:54 AM
Yeah, kind of overstepped my G.A.S. limit but it provided a little mental exercise.

Normally would have left it at my first post but I was a little bored ...
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 04, 2018, 03:46 PM
Thanks but I struggle to follow. I find AHK much more straight forward. But how can I make it work with a 25GB?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 04, 2018, 04:59 PM
In the meantime I will read https://www.itprotoday.com/management-mobility/dons-188-minute-powershell-crash-course-you-can-learn
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 05, 2018, 03:10 AM
In the meantime I will read https://www.itprotoday.com/management-mobility/dons-188-minute-powershell-crash-course-you-can-learn
I expect you comprehend what's written there, it seems quite suited for powershell n00bs.

Thanks but I struggle to follow. I find AHK much more straight forward. But how can I make it work with a 25GB?
Please stop asking for an AHK solution, those that participated here sofar aren't going to provide it, as a perfect solution is already provided.

If you had tried the script at actual data, you wouldn't have asked again about the 'measily' 25 GB files; yes, ofcourse it will take some time to process, but so does an 18.8 minute powershell crash course.
Powershell is built on the foundation of .NET, so it knows how to handle files efficiently.

Why did you leave these rather important 'details' out in your original question?

Um .... because that's what he normally does ... it always takes at least a week, (sometimes longer or never), before all pertinent information (https://d.cxcore.net/Eric%20S%20Raymond/How%20To%20Ask%20Questions%20The%20Smart%20Way.pdf) is obtained ...
I know, I know, I'm just trying to educate someone (again, but it doesn't seem to be picked up much), see my quote below...
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 05:46 AM
$items - an arbitrarily named variable
=        - sign signifying equality
Get-ChildItem (https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-childitem?view=powershell-6)

Thus $items now equals an array of files in the current folder that match *.txt
$items[0] = firstfile.txt
$items[1] = secondfile.txt
etc
etc
etc

$items.Count  - total number of matching files found

for(){}   - a for (https://ss64.com/ps/for.html) loop, $i is a variable that gets incremented by 1 every loop until the total number of matching files is reached

Thus loop through all the files in the array performing the following on every file:

Select-String -Path $items[$i] -Pattern $regex -AllMatches

Search each file for matching RegEx pattern, get all matches.

| % { $_.Matches } | % { $_.Value } >> $outfile

RegEx matches are piped into a ForEach (https://ss64.com/ps/foreach.html) loop, (shorthand notation). For each regex match, pipe it's value to the output file in append mode.

Don't actually need to escape the " in the RegEx either:
Code: PowerShell [Select]
  1. $regex = '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">'
Will also work.

Same as the 6 lines above (https://www.donationcoder.com/forum/index.php?topic=45945.msg422126#msg422126) without assigned variables or a for loop:
Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '<dsf:tsdfgd trsdfge="urn:x-ssdfgs-dfg-com:isdfgc/tg4r3e-i4d" id="OsdfgsdfD">' -a | % { $_.Matches } | % { $_.Value } >> K:\out.txt }


That is very helpful thanks!

From what I have understood, the script will first scan its own folder where it exists, for all the txt files present and process them one by one in an array. Actually I think I can skip that bit if it can process the whole 25GB txt file at once.

As for the actual regex matches, what I would actually like it to do is to:
- scan the source file for a regex(A)
- finding the first instance of regex(A), it would store it in a variable and search another regex(B) inside that variable.
- then I have a couple more regex matches that I need it to store in that variable and output specific things from these regex matches inside the initial regex(A). By output I mean write sequencially line by line in an output file.
- then the loop will continue with the next regex(A) match inside the source file, and store it in a variable, and search for the same regex(B) etc matches inside that variable and output parts of those regex matches in the output file.

Sounds very basic and simple. Can you tell me what commands I need to write something like that please?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 05, 2018, 06:20 AM
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 05, 2018, 09:35 AM
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.

Skunds like the same problem that I face at work.  Except I get paid to deal with the frustration.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 11:03 AM
If you are searching for a regex within a regex, 'You Are Doing It Wrong' (T).

You initial requirement was to find and extract content using a regex, but now you need parts of that regex to be split out? That can be done using a single regex, grouping the stuff you need to split out.
And for this whole exersize to make any sense, where is the variable part of the data to find? When searching for explicit text(s), a count would suffice...
Please provide a complete example, with actual data (not an entire file!), clearly marking the stuff you need to extract, of what you want to achieve, not how you think it could/should be solved.

Indeed, I now realised it!
I will try to provide an example in a bit.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 11:09 AM
The format of the data is like that (the only difference is that the data is multiline rather than single line as in this example):

prod1
blah
specs=a
blah
price=b
blah
prod2
blah
specs=c
blah
price=d
blah

So I want the output to be a csv like:
prod1; a; b
prod2; c; d

So I was thinking first a regex to highlight/save in a variable the first area of the text that belongs to a prod, which is the the first six lines (I cannot use the number of lines to distinguish them as they vary).
Then it would extract a and b from that variable by matching the specs and price regex 'within' prod1 variable, so that I can distinguish them from prod2.
And then loop to complete the conversion.

Hope this helps?

So my understanding is that I cannot search for a regex that will match "specs=.+?" or something because I won't be able to distinguish this for prod1, prod2, etc.
At the same time, I cannot match the regex "prod1.+specs=.+?" because I don't know the exact text for prod1 (it's an xml attribute that is called prodID, but the value can be anything).

Do you have any idea on how to process this?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 05, 2018, 11:58 AM
This example is totally useless. >:(

Please extract 2 or 3 of those (complete) product records from your actual data file. Optionally replace confidential stuff (data, prices)with aaaaa, bbbbb, 1.23, etc., but leave the structure exactly as it is!
Then post that here.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 01:44 PM
Why is it useless? It's exact representation apart from the fact that are more irrelevant text around.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 01:55 PM
You can check this as well:


,["t--ddbPTeIsNI","iGTzEhwhMx4U","r-iGTzEhwhMx4U",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",1234125123l,[null,null,"inline"]
]
,["num_cols",14351435,[null,null,null,2.0]
]
,["max_timing",235123512,[null,null,null,2500.0]
]
,["check_parent_card",143512122,[null,null,null,null,1]
]
,["counterfactual_logging",213513212412,[null,null,null,null,0]
]
]
]
,["t--ddbPTeIsNI","iLS0pb0OlVDE","r-iLS0pb0OlVDE",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",4311231235,[null,null,"inline"]
]
,["num_cols",12341241234,[null,null,null,2.0]
]
,["max_timing",23512351223,[null,null,null,2500.0]
]
,["check_parent_card",5235123412,[null,null,null,null,1]
]
,["counterfactual_logging",12351251212,[null,null,null,null,0]
]
]
]
,["t--ddbPTeIsNI","ibE7thiz85_Y","r-ibE7thiz85_Y",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",124351235,[null,null,"inline"]
]
,["num_cols",623423451,[null,null,null,2.0]
]
,["max_timing",123512351,[null,null,null,2500.0]
]
,["check_parent_card",1235125123,[null,null,null,null,1]
]
,["counterfactual_logging",12351235145,[null,null,null,null,0]
]
]
]

Let's say I want to extract the numbers in the fields ui_mode etc or each of these three separate records.
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 05, 2018, 03:24 PM
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 03:48 PM
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.


Is your second line a question?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 05, 2018, 03:51 PM
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.


Is your second line a question?
Yes, 2 actually.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 05, 2018, 03:54 PM
Now we are getting somewhere, sort of.

You only didn't tell what other parts of the data you need extracted from each record, besides the ui_mode field, and what the identifying field is that should go in the first column of the csv output you suggested earlier.

We can extract the ui_mode and max_timing and the first column would be the second text in "", ie for the first recond iGTzEhwhMx4U
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 05, 2018, 08:10 PM
It looks suspiciously like JSON data which PowerShell can handle without using RegEx too much. My bad, wrong type of brackets.

What's the first 10-20 lines of the file?
And the last 20 or so, that'll give us enough, (in theory), to create a small test file.

If it was XML might be able to just use the Select-XML (https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/select-xml?view=powershell-6) commandlet.

Why is it useless? It's exact representation apart from the fact that are more irrelevant text around.

No, it's your interpretation not the exact data, (raw data), which would show us the structure.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 03:26 AM
The first lines are:

<?xml version="1.0" encoding="ISO8859-1" ?>
<CATALOG>
 <PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>
 <PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
 </PLANT>

Now this goes on and on and the last lines are:
<PLANT>
 <COMMON>Cardinal Flower</COMMON>
 <BOTANICAL>Lobelia cardinalis</BOTANICAL>
 <ZONE>2</ZONE>
 <LIGHT>Shade</LIGHT>
 <PRICE>$3.02</PRICE>
 <AVAILABILITY>022299</AVAILABILITY>
 </PLANT>
</CATALOG>

But I do not want to work it with Select-XML because it will limit my learning a lot. Instead I want to use REGEX so that I can learn something that can be applied to many other situations.
I believe I need to learn in PowerShell:
1) how to read file
2) how to search for a regex, store it in a variable then perform another regex search in that variable and return a part of the match or append it in an output file
3) how to search for the next instance of the regex and loop the above
4) all regexes must be multiline
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 06, 2018, 05:23 AM
,["t--ddbPTeIsNI","iGTzEhwhMx4U","r-iGTzEhwhMx4U",[["debug",null,null,null,null,[null,null,null,null,0]
]
,["ui_mode",1234125123l,[null,null,"inline"]
]
,["num_cols",14351435,[null,null,null,2.0]
]
,["max_timing",235123512,[null,null,null,2500.0]
]
,["check_parent_card",143512122,[null,null,null,null,1]
]
,["counterfactual_logging",213513212412,[null,null,null,null,0]
]
]
]

<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>


A couple of questions:
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 06, 2018, 05:31 AM
it will limit my learning a lot
Well, please first try to learn how to describe your challenge well, a tutorial was linked earlier by 4wd, then we will try to teach you how to best solve your challenge. It may not need regex at all.

A common saying about regexes goes like this: You try to solve a problem with a regex. Now you've got 2 problems...
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 06, 2018, 06:01 AM
But I do not want to work it with Select-XML because it will limit my learning a lot.

If you are going to work with XML then learn to use the most efficient means available otherwise it isn't learning, (well obviously you'll learn from your mistakes but why be inefficient?).

The same applies to JSON, CSV, etc, etc - learning to use the wrong method to achieve what you want is what cripples you.  To put it simply: Use the right tool for the job.

From the Programmer Humor thread, a very eloquent StackOverflow answer (https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) that illustrates what Ath said about regexes above.

Do you have one of these files that is considerably less than 25GB, that contains no proprietary data, that you can 7-Zip, (being plain text with lots of repetition it should compress well), then upload to GDrive or somewhere, and then provide us with a link?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 07:01 AM
Mmm, I see.

Well guys, the data is what I posted in my last post (Plants), these are three sample records and they keep repeating (with different values).

How do I parse this in the most easy way?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 07:46 AM
By the way, is there a way to do 'find next' in Powershell without having to find all matches and create an array? I imagine the latter is very RAM consuming.
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 06, 2018, 07:47 AM
it will limit my learning a lot
Well, please first try to learn how to describe your challenge well, a tutorial was linked earlier by 4wd, then we will try to teach you how to best solve your challenge. It may not need regex at all.

A common saying about regexes goes like this: You try to solve a problem with a regex. Now you've got 2 problems...

Totally agree! One of our systems at work is based on Regex, and its very powerful, but very touchy to change!  ;D

If you're going to use regex, though its not necessary to purchase a tool like regexbuddy, I'd recommend it to preserve your sanity!
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 08:13 AM
Yeah, I want to use RegEx to be honest. But I struggle to find a way to do it.

The first line of the data contains an ID. So I can store all these IDs in an array.
Then, for each entry in the array, I will be able to match some regex and output them.

The problem is that the data gets into so many deep tree branches that it gets hard to isolate them.

Mmmmm! Now I got an idea.
If I could convert the XML file in a flat structured file, where each line will display the attribute name and value (as it normally does in XML), but it will also display the attributes and values from all the above hierarchy!

That way, it will be much more manageable, because I will be able to isolate and process specific lines.

Any script that can do this?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 10:55 AM
By the way, this script looks amazing (From: https://www.codeproject.com/articles/61900/powershell-and-xml):

PS C:\> $xml = (Get-Content file.xml)
PS C:\> $xml = [xml](Get-Content file.xml)
PS C:\> $xml.SelectNodes("/employees/employee")

id                                      name                                    age
--                                      ----                                    ---
101                                     Frankie Johnny                          36
102                                     Elvis Presley                           79
301                                     Ella Fitzgerald                         102

But I cannot make it work for my file. Any hint?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 11:36 AM
Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.

Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?

Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 06, 2018, 11:37 AM
But I cannot make it work for my file. Any hint?

Yeah, as Ath suggested, your XML  contains CDATA so you have to read that separately.

https://stackoverflow.com/questions/1274070/how-to-read-cdata-in-xml-file-with-powershell
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 12:50 PM
But I cannot make it work for my file. Any hint?

Yeah, as Ath suggested, your XML  contains CDATA so you have to read that separately.

https://stackoverflow.com/questions/1274070/how-to-read-cdata-in-xml-file-with-powershell

I will try but can you help me with the below:

Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.

Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?


Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 06, 2018, 02:14 PM
If I could convert the XML file in a flat structured file
Converting your .xml to .csv is a quite easy one-liner in Powershell, assuming a single xml file, into a single .csv file:
Code: PowerShell [Select]
  1. $([xml]$(Get-Content .\kalos-data1.xml)).SelectNodes("/CATALOG/PLANT")|Export-Csv .\kalos-data1.csv
The reason it wasn't working for you was probably that you didn't account for xml to be case-sensitive.

Parsing an .xml file containing CDATA tags with a regex is close to impossible to get right, as nearly any content is possible inside such CDATA tag, including valid xml...


Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.
Please listen to people with more (programming) experience than you have, you are really trying to hammer round screws into square holes here, don't do that, you'll hurt yourself.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 02:25 PM
I don't understand what CDATA is.

My xml file contains tons of tags, ie text inside <>, in a complex hierarchy.
Apart from that, it contains values both inside the <>, in the format of <someTag someID="SomeValue"> and in the format of <someTag>SomeValue<\someTag>

1) I don't know what the total number and hierarchy of tags is. So can I select ALL nodes under the whole hierarchy?
2) Will PS process the both formats of values above?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 02:27 PM
Guys, the more I am looking on it, the more I am convinced that Regex would be the best solution.
Please listen to people with more (programming) experience than you have, you are really trying to hammer round screws into square holes here, don't do that, you'll hurt yourself.

OK but I would be highly interested to learn how to do the below?
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 06, 2018, 02:45 PM
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that? Last, how to find the next regex match in the file?
Well, the trouble is you'll have to do it in some script or programming language, as regex is actually a selection mechanism using pattern matching ('regular expressions').
For such a task I'd advise to use sed, the Stream EDitor, originally from unix, but also available for Windows, that is built for jobs like this.
I've made a Sed-Tester tool for NANY a couple years back, find it from the link below this post and try it out, it includes sed.exe, and has a link to sed documentation in the gui.
You can also continue on the PS-script 4wd gave here earlier, but that doesn't go through the data sequentially in the way sed does.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 06, 2018, 03:17 PM
Stream EDitor

Very interesting tool, thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 06, 2018, 04:41 PM
Can anyone tell me please how to find a regex in a file and append it to a file? Also, how to loop that?

That is what the code I originally posted does.

Last, how to find the next regex match in the file?

Add another Select-String line with the next RegEx.

I'm going to give up until we get at least sensible raw data and what the expected output should look like ...
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 07, 2018, 04:16 AM
Last, how to find the next regex match in the file?
Add another Select-String line with the next RegEx.

No, I don't mean a different regex. I mean the same regex. The same regex may have multiple matches in one file. How do I make the script to find the first instance, do stuff, then find the second instance, do stuff etc?

Also, how do I make . to include newline?

Also, what is | % { $_.Matches } | % { $_.Value } >> $outfile exactly?
I don't know what % and { $_. and Value are?

Also, how do I return a specific part from regex? In normal regex text editors, you put the part in parentheses and then you replace them with \1 etc. How do I do it in Powershell?

Also, I should be able to figure this out myself, but I am looking for a neat code and I can only manage to come up with messy stuff: is there a script to delete lines not containing specific literate phrases? E.g. not containing 'lue } >> $ou' without having to go through each character to check if it needs escaping or not.
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 07, 2018, 08:50 AM
You obviously don't understand what we wrote earlier. I already gave 2 (two) possible ways how to handle that. And there are other solutions too.

Like 4wd, I'll pick up again after you start giving complete examples of real pieces of a data file with exact descriptions of what you want to achieve, not of descriptions of how you think it should be handled/solved.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 07, 2018, 09:01 AM
I want answers to specific questions, not a complete solution. It is not possible for anyone external to offer a complete solution because the source data cannot be shared.
Also, most of my questions are for my own understanding and may not directly relate to the specific problem.
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 07, 2018, 09:29 AM
most of my questions are for my own understanding and may not directly relate to the specific problem.
You obviously don't understand what we wrote earlier. I already gave 2 (two) possible ways how to handle that. And there are other solutions too.
To spell it out, again:
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 07, 2018, 10:05 PM
It is not possible for anyone external to offer a complete solution because the source data cannot be shared.

And there you go, a perfect illustration of the main problem.

Despite asking for raw data several times starting from page 1, (and getting various snippets and interpretations), we find out on page 3 that the data can't be shared.

Any reason why you couldn't have told us this when it was first asked?

Also, what is | % { $_.Matches } | % { $_.Value } >> $outfile exactly?

|
A Pipe (https://quickleft.com/blog/command-line-tutorials-redirection-pipes/)

%
ForEach-Object (https://ss64.com/ps/foreach-object.html)

$_
Object passed through pipe.

Matches
A Property of the object.

Value
A Property of the object.

You do know that they have this thing called Google, right?

Let's say I want to extract the numbers in the fields ui_mode etc or each of these three separate records.

Code: PowerShell [Select]
  1. gci *.txt | % { sls $_.Name -Pattern '^.*"ui_mode",(\d+).*$' -a | % { $_.Matches } | % { $_.Groups[1].Value } >> K:\out.txt }

[ You are not allowed to view attachments ]
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 08, 2018, 08:59 AM
gci *.txt | % { sls $_.Name -Pattern '^.*"ui_mode",(\d+).*$' -a | % { $_.Matches } | % { $_.Groups[1].Value } >> K:\out.txt }

I still struggle very much with this and Google does not help. The main source of confusion I believe is the fact that I don't know if a term in the command is a random variable name or if it is a specific variable which is part of PS core. Also, another thing is that I may be able to Google "what does % mean in powershell" to find out what a specific symbol means but if they are part of another word like $_.Matches, it becomes confusing.

So the gci command will grab all the files that match *.txt in the directory.
The pipe means that we run sequentially another command.
The sls command means that it matches the regex inside the input $_.Name. What is that? I read "The $_ variable holds a reference to the current item being processed." But I don't understand what that means. Any idea?
Then, we move on to the command % which is basically a for-each loop. So for each of the Matches found with the previous command, we run another for each loop, the Groups[1].Value. I don't know what that is either. Any idea?
Finally we append to out.txt.

Are the terms Name, Matches, Groups, Value, something standard in Powershell or they are random variables names?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 08, 2018, 01:39 PM
Name, Matches, Groups, Value, something standard in Powershell or they are random variables names?
4wd did a fine job of explaining all the special chars / shortcuts he used in the post just above yours.
Name is the standard attribute holding, wait for it..., the name of the object.
Matches, Groups and Value are attributes of the Regex resulttype Match (.NET, the base of Powershell, as I explained earlier), containing an entire match, all groups found in the regex provided, with Groups[0] holding the entire result and Group[1] etc. reflecting the standard regex \1 result etc, and value holding the found value, usually the same as Groups[0].
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 08, 2018, 10:05 PM
4wd did a fine job of explaining all the special chars / shortcuts ...

Hey now, that's going a bit far ...

@kalos:

Powershell Objects (https://www.computerworld.com/article/2954261/data-center/understanding-and-using-objects-in-powershell.html)
The Complete Guide to Powershell Punctuation (https://www.red-gate.com/simple-talk/wp-content/uploads/2015/09/PSPunctuationWallChart_1_0_4.pdf)
Powershell Commands (https://ss64.com/ps/)
Powershell Syntax (https://ss64.com/ps/syntax.html)

Every piece of script I've put in this thread resulted from one simple search (https://www.google.com/search?q=powershell%20match%20all%20strings%20in%20files) - especially the last piece (https://stackoverflow.com/questions/26891275/powershell-select-string-from-file-with-regex).

I search Google until I get close enough to the answer I want, it's how I learnt Powershell, (and I'm nowhere near being proficient in it).  If you're not getting what you want you're probably being too specific, change your search parameters, because it's a high probability that someone else has already asked it and most likely had it answered.

I suggest you read the link I posted earlier: How To Ask Questions The Smart Way (http://catb.org/~esr/faqs/smart-questions.html)

[Insert deity] knows I'm guilty of asking stupid questions but at least I try to provide all the information required/asked to help whoever it is that may decide to offer help, whether it be data, required output, the right question, cup of coffee, chocolate filled croissant, new car ... (wait, I draw the line there), etc.

And we still have no idea how your various bits of data file relate to each other.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 09, 2018, 10:53 AM
That's good info, so I will have to have a good read on these to be able to understand PS.

I am better understanding by studying examples so I would appreciate your help with this.

PS: This is not related to the initial data file I wanted to process.

Any idea why this does not work?
Get-Content *.xml | Out-String

I want then to append to a file all the matches of a regex1 or regex2. Any idea?

Also, any idea on how to extract specific values from xml nodes?
I type  select-xml -path *.xml -xpath "/html:html/html:products/html:product/html:referenceData/html:pct/html:productId" and it doesn't work

Something else, I want to parse the text of a file and use -match on it, but I cannot figure out how to do it, it's so embarrassing!
gc *.xml -match *regex*
does not work :(
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 09, 2018, 07:59 PM
Any idea why this does not work?
Get-Content *.xml | Out-String

I have a better idea, you tell us why you think it doesn't work.

gc *.xml -match *regex*
does not work :(

Code: PowerShell [Select]
  1. Get-Help Get-Content

You tell us why it doesn't work.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 10, 2018, 05:32 AM
Any idea why this does not work?
Get-Content *.xml | Out-String

I have a better idea, you tell us why you think it doesn't work.

gc *.xml -match *regex*
does not work :(

Code: PowerShell [Select]
  1. Get-Help Get-Content

You tell us why it doesn't work.

Mmm I don't know why Get-Content *.xml | Out-String does not work to be honest.
I read at https://ss64.com/ps/out-string.html:
Send the content of Test1.txt to the console as a single string:
PS C:\> get-content C:\docs\test1.txt | out-string
So, shouldn't it work?

As for gc *.xml -match *regex* it seems that -match does not go after gc, but how do I connect/pipe these?
I had no clue, but I found in an irrelevant place on the internet this:
(Get-Content .\input.txt) -join "`r`n"
How should I know that I need to parenthesise the first object? Where does it say that in PS manual?
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 10, 2018, 07:55 AM
I think he was asking, what do you mean doesn't work?  You just said it didn't work and didn't give any indication of what happened.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 10, 2018, 08:38 AM
They were to get a couple of points across, which were completely missed.

1. His guess as to why the first doesn't work will always be better than whatever we could come up with because we never, ever get enough information/context to make an informed guess/opinion.

2. He doesn't read what has already been given because the answer is in this thread.

PS. Sorry mouser ...  :-\
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 10, 2018, 09:43 AM
2. He doesn't read what has already been given because the answer is in this thread.

PS. Sorry mouser ...  :-\

Oh sorry but due to my learning difficulty I need to be pinpointed to the exact thing.
I am developing my PS understanding though and it seems very powerful  :Thmbsup:

I wrote this script:
(gc *.txt)  -replace "regex1(.+?)", "`$1" >> out

It replaces the regex with the reference from the regex and outputs to a new file. Not exactly what I want it to do.
I want to output the reference, any idea how to do that?
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.

I believe piping commands does not achieve this. I think piping is about getting the output object from the previous command and feed it to the next command. However, I want the various regex matches to work on the original object sequentially. This is a bit tricky, any idea?


Also, can you tell me please how to find and select and append values from multiple xml nodes knowing their XPath?
I do that and it doesn't work:
Select-Xml -Path "*.xml" -XPath "/html:book/html:Entity" >> out
Also this doesn't work:
PS H:\> [xml]$Types = get-content *.xml
PS H:\> select-xml -xml $Types -xpath "//html:Entity"
select-xml : Namespace Manager or XsltContext needed. This query

Thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 10, 2018, 01:37 PM
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.
You have to make clear whether the results from the separate queries have any positional relation to each other, or can the queries be run one after the other and the output of the second, third, etc., runs appended to the first regex run?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 10, 2018, 08:21 PM
Also, I want to run sequential several regex matches with their own references, one by one and append each result to the output file.
You have to make clear whether the results from the separate queries have any positional relation to each other, or can the queries be run one after the other and the output of the second, third, etc., runs appended to the first regex run?

Or to put it another way:

DO NOT give us separate examples of two disparate data types without showing how they relate to one another, (eg. XMLw <> and CDATA (https://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean) []), within the same file.

Until you can do that we're just running around in circles and it's pointless continuing this thread, as such I'm out of here until the above happens.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 13, 2018, 05:03 AM
OK, so the input is:
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>

The above continues to prod2 etc.

The output of the data would be:
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
Then a new line would start with:
prod2; etc


However, I want to convert the input data in a string, because, I may need to match longer substrings than eg "<html:classificationType>(.+?)</html:classificationType>"
Also, I think there may be duplicates for each prod, e.g. more than one assignedDate node with different values, so MatchAll would be best.
thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 13, 2018, 07:18 AM
So it's always xml and it's always that schema?  And you're just worried about duplicates?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 13, 2018, 08:45 AM
So it's always xml and it's always that schema?  And you're just worried about duplicates?


Yeah, for now it looks like that.
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 13, 2018, 10:05 AM
And one last question... when you say duplicate, you mean the whole record is duplicated?  Or just some of the fields, i.e. productID or prod id?
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 13, 2018, 10:47 AM
And one last question... when you say duplicate, you mean the whole record is duplicated?  Or just some of the fields, i.e. productID or prod id?

Some fields, eg there may be more than one assignedDate value, so the script will need to process these additional fields for the same prod.

The pseudocode I am looking for is like this:
1) search for the first 'prod' section of the file, convert it to single line, extract the appropriate regex (all matches) one after the other (that's why I want to specify the all the regex matches that I want the script to search for when scanning the line, as I am not sure which order they will be - it shouldn't change but just in case)
2) then find the next 'prod' section in the file, convert it to single line and put it in a line below the previous, then extract the regexes one by one

Any hint?

I tried to use ¦ to add OR regex matches, but I think it didn't work.
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 13, 2018, 02:05 PM
OK, so the input is:
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>

The above continues to prod2 etc.

The output of the data would be:
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
Then a new line would start with:
prod2; etc


That finally makes some sense. Here (https://www.donationcoder.com/forum/index.php?topic=45945.msg422205#msg422205) is an example solution for putting that into a .csv formatted file.
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)

  But, earlier in this thread you wrote this:
<CATALOG>
 <PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
 </PLANT>
Well guys, the data is what I posted in my last post (Plants),
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?


However, I want to convert the input data in a string, because, I may need to match longer substrings than eg "<html:classificationType>(.+?)</html:classificationType>"
You are talking b.s. here.


Also, I think there may be duplicates for each prod, e.g. more than one assignedDate node with different values, so MatchAll would be best.
This doesn't make sense without an example, and MatchAll is inappropriate here.


extract the appropriate regex
PLEASE STOP TELLING US HOW TO SOLVE YOUR CHALLENGE!
(This could have been bigger and in red, but I'm trying to stay nice, so I didn't)
If you want to learn regex, go get a book or on-line course, there are plenty here (http://lmgtfy.com/?q=regex+for+dummies) and here (http://lmgtfy.com/?q=books+about+regex), and stop feeding us xml.

When handling XML, no regexes are usually involved, unless the data elements contain 'complex', somewhat structured, data that needs to be broken down.

I have this assignment for you:
- read the entire thread from OP to the end and formulate an answer to all unanswered questions we asked you. (Just quote the question and type the answer below the quote)
After all the answers are given you can ask 1 new question. As 4wd already stated, and you said yourself but in other words, you aren't good in answering questions, but it is required for other people to help you solve your challenge/quest.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 13, 2018, 07:41 PM
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)

Or by installing an updated XML module (https://www.powershellgallery.com/packages/Xml/7.0) which will give you Remove-XmlNamespace

Still not convinced it's sufficient information since somewhere there should have been a Namespace declaration I would have thought, he's only giving the information for a record within the file.

Well guys, the data is what I posted in my last post (Plants),
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?

PS: This is not related to the initial data file I wanted to process.

So far:

... and counting ...

It's another problem, jumping from one thing to another without getting any one thing completed.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 20, 2018, 08:37 AM
I don't understand why you do not answer my specific questions, regardless of the source data format and the desired output. Is what I am asking not possible to be done with Powershell?

For example, I want to perform a regex match that will output all matches of regex1 and regex2 and regex3.

How can I do that?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 20, 2018, 08:59 AM
How can I do that?
That's why we asked more specific questions, but you never answered them.
So then I gave you the assignment of answering all our unanswered questions, but you haven't done that up until now, so basically, we are waiting (but not holding our breath) for your answers, before accepting new questions. :(
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 20, 2018, 09:36 AM
How can I do that?
That's why we asked more specific questions, but you never answered them.
So then I gave you the assignment of answering all our unanswered questions, but you haven't done that up until now, so basically, we are waiting (but not holding our breath) for your answers, before accepting new questions. :(


OK I start again:

That finally makes some sense. Here (https://www.donationcoder.com/forum/index.php?topic=45945.msg422205#msg422205) is an example solution for putting that into a .csv formatted file.

The problem with that is that I do not always know the node tree hierarchy and also it may change per record! That's why I cannot use the node tree hierarchy to extract a value, but I can use a guess of it, if that helps, eg //NODE1/*/NODE3/ ?

It doesn't even look a teensy bit like this new data you've given just now, are you playing us?

It looks the same to me??? Only the attribute names and values change. But again, the records do not contain the same attributes and in the same order. There can be some basic rules that all records follow, but unfortunately the data structure is not consistent, that's why I want to use regex, to include some fuzziness in matching!

Also, I still have not figured out how to make Powershell match . any character including newline... Any hint?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 20, 2018, 07:15 PM
Also, I still have not figured out how to make Powershell match . any character including newline... Any hint?

Learn to use Powershell's built-in help system:
Code: PowerShell [Select]
  1. Get-Help about_comparison_operators

Learn to use Google:
http://lmgtfy.com/?q=powershell+regex+match+any+character+including+newline
http://lmgtfy.com/?q=powershell+regex+match+multiline
http://bfy.tw/JV48
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 21, 2018, 04:02 AM
Also, I still have not figured out how to make Powershell match . any character including newline... Any hint?

Learn to use Powershell's built-in help system:
Code: PowerShell [Select]
  1. Get-Help about_comparison_operators

Learn to use Google:
http://lmgtfy.com/?q=powershell+regex+match+any+character+including+newline
http://lmgtfy.com/?q=powershell+regex+match+multiline
http://bfy.tw/JV48

 :up: but I cannot see in the list of operators the OR  :tellme:
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 21, 2018, 07:49 AM
but I cannot see in the list of operators the OR  :tellme:

Code: PowerShell [Select]
  1. Get-Help about_Logical_Operators
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 21, 2018, 08:31 AM
but I cannot see in the list of operators the OR  :tellme:

Code: PowerShell [Select]
  1. Get-Help about_Logical_Operators

I did that, but I get this:

PS H:\> Get-Help about_Logical_Operators
Get-Help : Get-Help could not find about_Logical_Operators in a help file in this session. To download updated help
topics type: "Update-Help". To get help online, search for the help topic in the TechNet library at
http://go.microsoft.com/fwlink/?LinkID=107116.
At line:1 char:1
+ Get-Help about_Logical_Operators
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ResourceUnavailable: (:) [Get-Help], HelpNotFoundException
    + FullyQualifiedErrorId : HelpNotFound,Microsoft.PowerShell.Commands.GetHelpCommand
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 21, 2018, 08:43 AM
Also, I still have not figured out how to make Powershell match . any character including newline... Any hint?

Learn to use Powershell's built-in help system:
Code: PowerShell [Select]
  1. Get-Help about_comparison_operators

Learn to use Google:
http://lmgtfy.com/?q=powershell+regex+match+any+character+including+newline
http://lmgtfy.com/?q=powershell+regex+match+multiline
http://bfy.tw/JV48

I tried that but it is not clear if (?s) goes to either:
at the beginning of the regex and after the '
at the beginning of the regex and before the '
just before .

Any idea?
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 21, 2018, 09:04 AM
but I cannot see in the list of operators the OR  :tellme:

Code: PowerShell [Select]
  1. Get-Help about_Logical_Operators

I did that, but I get this:

PS H:\> Get-Help about_Logical_Operators
Get-Help : Get-Help could not find about_Logical_Operators in a help file in this session. To download updated help
topics type: "Update-Help". To get help online, search for the help topic in the TechNet library at
http://go.microsoft.com/fwlink/?LinkID=107116.
At line:1 char:1
+ Get-Help about_Logical_Operators
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ResourceUnavailable: (:) [Get-Help], HelpNotFoundException
    + FullyQualifiedErrorId : HelpNotFound,Microsoft.PowerShell.Commands.GetHelpCommand

Code: PowerShell [Select]
  1. Get-Help about_*

if you don't see it there,

Code: PowerShell [Select]
  1. get-help update-help
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 21, 2018, 09:42 AM
Thanks, but it needs me to run it as admin, which I cannot.

Any idea why the below does not work?

(gc *.xml) -match '(?s)<\?xml\ version="1\.0"\ encoding="UTF-8"\?>.+?</dbts:PmryObj>'
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 21, 2018, 12:26 PM
Any idea why the below does not work?
As usual you are asking half questions without *any* documentation. And you still haven't answered all previous questions, as requested, (and even have asked new questions in the half-baked 'answer') so in my book, you're not yet ready to ask new questions.
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 21, 2018, 05:16 PM
Thanks, but it needs me to run it as admin, which I cannot.


TOPIC
    about_Logical_Operators

SHORT DESCRIPTION
    Describes the operators that connect statements in Windows PowerShell.


LONG DESCRIPTION
    The Windows PowerShell logical operators connect expressions and
    statements, allowing you to use a single expression to test for multiple
    conditions.


    For example, the following statement uses the and operator and
    the or operator to connect three conditional statements. The statement is
    true only when the value of $a is greater than the value of $b, and
    either $a or $b is less than 20.


        ($a -gt $b) -and (($a -lt 20) -or ($b -lt 20))


    Windows PowerShell supports the following logical operators.


        Operator  Description                      Example
        --------  ------------------------------   ------------------------
        -and      Logical and. TRUE only when      (1 -eq 1) -and (1 -eq 2)
                  both statements are TRUE.         False


        -or       Logical or. TRUE when either     (1 -eq 1) -or (1 -eq 2)
                  or both statements are TRUE.     True


        -xor      Logical exclusive or. TRUE       (1 -eq 1) -xor (2 -eq 2)
                  only when one of the statements  False
                  is TRUE and the other is FALSE.


        -not      Logical not. Negates the         -not (1 -eq 1)
                  statement that follows it.       False


        !         Logical not. Negates the         !(1 -eq 1)
                  statement that follows it.       False
                  (Same as -not)


    Note: The previous examples also use the equal to comparison
          operator (-eq). For more information, see about_Comparison_Operators.
          The examples also use the Boolean values of integers. The integer 0
          has a value of FALSE. All other integers have a value of TRUE.


    The syntax of the logical operators is as follows:


        <statement> {-AND | -OR | -XOR} <statement>
        {! | -NOT} <statement>


    Statements that use the logical operators return Boolean (TRUE or FALSE)
    values.


    The Windows PowerShell logical operators evaluate only the statements
    required to determine the truth value of the statement. If the left operand
    in a statement that contains the and operator is FALSE, the right operand
    is not evaluated. If the left operand in a statement that contains
    the or statement is TRUE, the right operand is not evaluated. As a result,
    you can use these statements in the same way that you would use
    the If statement.


SEE ALSO
    about_Operators
    Compare-Object
    about_Comparison_operators
    about_If

If you can't run an elevated command prompt... good luck in this endeavor.  You're going to need it.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on August 22, 2018, 03:55 AM
Any idea why the below does not work?
As usual you are asking half questions without *any* documentation. And you still haven't answered all previous questions, as requested, (and even have asked new questions in the half-baked 'answer') so in my book, you're not yet ready to ask new questions.

But I need ad-hoc answers, it's not about a specific thing I try to achieve, but mostly to learn
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 22, 2018, 04:44 AM
But I need ad-hoc answers, it's not about a specific thing I try to achieve, but mostly to learn

And yet every answer given here can be found on Google ... if you're going to learn anything, learn to ask the right questions.

Input
<?xml version="1.0" encoding="ISO8859-1" ?>
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod2">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD2</html:classificationType>
          <html:productType>PRD_XE2</html:productType>
          <html:productId>10005</html:productId>
          <html:assignedDate>2018-12-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS2</html:name>
          <html:Entity>REP_XE2</html:legalEntity>
          <html:location>ED2</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod3">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD3</html:classificationType>
          <html:productType>PRD_XE3</html:productType>
          <html:productId>10014</html:productId>
          <html:assignedDate>2013-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS3</html:name>
          <html:Entity>REP_XE3</html:legalEntity>
          <html:location>ED3</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod4">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XE4</html:productType>
          <html:productId>10567</html:productId>
          <html:assignedDate>2010-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS4</html:name>
          <html:Entity>REP_XE4</html:legalEntity>
          <html:location>ED4</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod5">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD5</html:classificationType>
          <html:productType>PRD_XE5</html:productType>
          <html:productId>10004890</html:productId>
          <html:assignedDate>2015-05-15</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS5</html:name>
          <html:Entity>REP_XE5</html:legalEntity>
          <html:location>ED5</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
</html:products>


Code: PowerShell [Select]
  1. gc "test.xml" -Raw | sls '(?smi)(<html:prod\s.+?/html:prod>)' -AllMatches | % {$_.Matches} | % { ((((($_.Value) -replace '(<[^>]+>|\s)', '; ') -replace '`r', '') -replace '`n', '') -replace '(;\s)(;\s)+', '$1').Trim('; ') }

Output
PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
PRD2; PRD_XE2; 10005; 2018-12-23; REPAIRS2; REP_XE2; ED2
PRD3; PRD_XE3; 10014; 2013-07-23; REPAIRS3; REP_XE3; ED3
PRD4; PRD_XE4; 10567; 2010-07-23; REPAIRS4; REP_XE4; ED4
PRD5; PRD_XE5; 10004890; 2015-05-15; REPAIRS5; REP_XE5; ED5



Your homework to increase your knowledge: Render the above one line into a multi-line Powershell script with no command shortcuts.
Non-optional extra: Tell us what the RegEx is doing.
Optional extra: Fix it so you get the prod value from the input data at the start of the output lines.
Optional extra: Make it process multiple files without using gc *.xml anywhere in it.

If it doesn't work on your data, you tell us why, don't ask us, we're not mind readers.

I'm done.
Title: Re: Extract REGEX matches from multiple text files
Post by: tomos on August 22, 2018, 04:52 AM
.. I need ad-hoc answers, it's not about a specific thing I try to achieve, but mostly to learn
it's nice to see the enthusiasm for learning :up:

Regards the responses you're getting here - I know nothing about the topic but can see you're being given a big opportunity to learn how to approach things, how to tackle a problem, how to learn.

Can you tell us:
why don't you take the experts' advice?
why don't you answer their questions?
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 22, 2018, 06:30 AM
Thanks, but it needs me to run it as admin, which I cannot.

And it's taken 4 pages to find that out - something that should have been stated earlier.

Any idea why the below does not work?

(gc *.xml) -match '(?s)<\?xml\ version="1\.0"\ encoding="UTF-8"\?>.+?</dbts:PmryObj>'

Sure.

Q: Whats's the input data?
A: We don't know.

Q: What's the command output?
A: We don't know.

Q: What version of Powershell are you using?
A: We don't know.

Q: What OS are you using, (including architecture)?
A: We don't know.

Q: What's the statistics of the input file, (eg. size)?
A: We don't know.

Q: Why the hell are you trying to process all files at once instead of one at a time?
A: We don't know.

etc, etc, etc, etc ... for 4 pages.

Idea: We don't know.
Why: See point 1 here (https://www.donationcoder.com/forum/index.php?topic=45945.msg422337#msg422337).
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on August 22, 2018, 07:05 AM
+1

But I need ad-hoc answers, it's not about a specific thing I try to achieve, but mostly to learn

In that case be as clear as you can be, by asking fully documented questions, meaning: (and I've said this before)

This entire thread is full of examples of you not following these business-standard rules...
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 22, 2018, 08:40 AM
Thanks, but it needs me to run it as admin, which I cannot.

And it's taken 4 pages to find that out - something that should have been stated earlier.

Any idea why the below does not work?

(gc *.xml) -match '(?s)<\?xml\ version="1\.0"\ encoding="UTF-8"\?>.+?</dbts:PmryObj>'

Sure.

Q: Whats's the input data?
A: We don't know.

Q: What's the command output?
A: We don't know.

Q: What version of Powershell are you using?
A: We don't know.

Q: What OS are you using, (including architecture)?
A: We don't know.

Q: What's the statistics of the input file, (eg. size)?
A: We don't know.

Q: Why the hell are you trying to process all files at once instead of one at a time?
A: We don't know.

etc, etc, etc, etc ... for 4 pages.

Idea: We don't know.
Why: See point 1 here (https://www.donationcoder.com/forum/index.php?topic=45945.msg422337#msg422337).

Did you look up how to get the answers on Google for the ones related to your environment?  And several of those that you're giving are not things that you wouldn't know.  You know what your input data is.  You know what you would expect as an output.  You know this stuff or can get it.  Which leads people to believe that you're not trying to give us the information.  And so why waste time with incomplete information?  Learning difficulties is just an excuse for all of the questions that you postedNone of those have to do with learning.

I literally highlighted what you typed for "What version of Powershell are you using" right clicked on it, searched in Bing, and came up with the answer.  There's no reason you couldn't do the same.

http://lmgtfy.com/?s=b&q=What+version+of+Powershell+are+you+using

https://www.bing.com/search?q=What+version+of+Powershell+are+you+using
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 24, 2018, 06:37 AM
Code: PowerShell [Select]
  1. Clear-Host
  2. $products = (Get-Content "test.xml" -Raw) -xxxxx '(....)^.*?(..........................)'
  3. for ($i = 1; $i -lt $products.Count; $i += 2) {
  4.   $products[$i] -xxxxx '(.........)(...)(...)' | Foreach { Write-Host ($Matches[0] + (((($products[$i] -replace '(<[^>]+>|\s)', '; ' ) -replace '`r', '') -replace '`n', '') -replace '(;\s)(;\s)+', '$1').TrimEnd('; ')) }
  5. }

Output
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
prod2; PRD2; PRD_XE2; 10005; 2018-12-23; REPAIRS2; REP_XE2; ED2
prod3; PRD3; PRD_XE3; 10014; 2013-07-23; REPAIRS3; REP_XE3; ED3
prod4; PRD4; PRD_XE4; 10567; 2010-07-23; REPAIRS4; REP_XE4; ED4
prod5; PRD5; PRD_XE5; 10004890; 2015-05-15; REPAIRS5; REP_XE5; ED5


-xxxxx = An operator

'(....)^.*?(..........................)' = A RegEx, number of dots represents number of characters in it.

'(.........)(...)(...)' = A RegEx, number of dots represents number of characters in it.


Mental exercise complete ...

Single Line Version
Code: PowerShell [Select]
  1. gci *.xml | % {(gc $_ -Raw) -xxxxx '(....)^.*?(............................)' | % { if ($_ -xxxxx '(.........)(...)(...)') { ($matches[0] + (((($_ -replace '(<[^>]+>|\s)', '; ' ) -replace '`r', '') -replace '`n', '') -replace '(;\s)(;\s)+', '$1').TrimEnd('; ')) } }}

Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 24, 2018, 06:58 AM
Maybe we should just shut this down as it's getting a bit heated, and I think that everyone is done.
Agree. Moderator, please move this thread to underground.

Nah, just remove posts 94-97, 100-104 and then lock the thread - there is useful information in the various posts which won't be seen if moved to the Underground.
Title: Re: Extract REGEX matches from multiple text files
Post by: wraith808 on August 24, 2018, 03:14 PM
Cleaned up the thread as requested.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 24, 2018, 03:32 PM
Thanks Wraith.

Regarding the script above:
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on August 25, 2018, 11:28 PM
Code: PowerShell [Select]
  1. Get-Help about_*

Thanks, didn't know there was so many  ;D
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 11, 2018, 11:08 AM
Guys, can anyone tell me the command that will find all the regex matches, isolate a specific part of each regex match and output all of them in a file?

I have the regex, but I don't know how to indicate a part in it.
The regex is this: "<html:productType>(.+?)</html:productType>"
I used the parentheses to isolate the part of the regex that I want to be output in the file.

How the whole command should be?

I found online and wrote this:
[regex]::match($s,"<html:productType>(.+?)</html:productType>").Groups[1].Value
But I don't know where you specify the source text or if it is correct. Any hint?

Thanks!

PS: It is really a nightmare to do some simple stuff in Powershell. There is very poor and incomplete documentation. Do you think there could be any other solution? Python maybe or anything else? I need it to work with big data though and if it has GUI it would be nice. Also, it needs to be free for commercial and any use.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 12, 2018, 04:23 AM
Guys, can anyone tell me the command that will find all the regex matches, isolate a specific part of each regex match and output all of them in a file?

I have the regex, but I don't know how to indicate a part in it.
The regex is this: "<html:productType>(.+?)</html:productType>"
I used the parentheses to isolate the part of the regex that I want to be output in the file.

How the whole command should be?

I found online and wrote this:
[regex]::match($s,"<html:productType>(.+?)</html:productType>").Groups[1].Value
But I don't know where you specify the source text or if it is correct. Any hint?

Thanks!

PS: It is really a nightmare to do some simple stuff in Powershell. There is very poor and incomplete documentation. Do you think there could be any other solution? Python maybe or anything else? I need it to work with big data though and if it has GUI it would be nice. Also, it needs to be free for commercial and any use.


Anyone please?
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on September 12, 2018, 05:07 AM
Any hint?
We've been here before: https://www.donationcoder.com/forum/index.php?topic=45945.msg422274#msg422274
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 12, 2018, 05:20 AM
Any hint?
We've been here before: https://www.donationcoder.com/forum/index.php?topic=45945.msg422274#msg422274

Ah great thanks!

I tested it and there is an issue. I searched in the file and there is only one instance of <html:productType>(.+?)</html:productType>
However, the output file mentioned the above value (.+?) twice. What could be the problem?

Thanks!

gci C:\XML.xml | % { sls $_.Name -Pattern '<html:productType>(.+?)<\/html:productType>' -a | % { $_.Matches } | % { $_.Groups[1].Value } >> C:\out.txt }
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on September 12, 2018, 10:10 AM
What could be the problem?
You haven't shared the file, so we'll never know, unless...
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 13, 2018, 05:10 AM
What could be the problem?
You haven't shared the file, so we'll never know, unless...

I made it work like that:
gci FILEPATH | sls -AllMatches '<html:productType>(.+?)<\/html:productType>' | % { $_.Matches } | % { $_.Groups[1].Value } >> FILEPATH\out.txt

But I don't know how I made it work lol, can you spot the error? Also, I know I asked before, but can you point me to somewhere that explains % { $_.Matches } | % { $_.Groups[1].Value } ?
I think % means 'for every' and $_.Matches is the object variable of the matches, while $_.Groups[1].Value is the content value of the matches objects, right? But what is [1]?

UPDATE: it seems both work, but which would be better?
Thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 17, 2018, 05:03 AM
Guys, after I search for regex matches in a text, how can I group the matches to separate files, by same reference inside the regex match?

For example, for every regex match <html:producttype>(.+?)</html:producttype>, I want to output to a separate file all the matches where the (.+?) is the same.

Any idea? Also, please explain the strategy/pseudocode to see how that would work.
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on September 17, 2018, 12:59 PM
I want to output to a separate file all the matches where the (.+?) is the same.
Run a second command on your previous output
Code: PowerShell [Select]
  1. gci FILEPATH\out.txt|group|select Count,Name >FILEPATH\out-counted.txt
The code is also the pseudo code.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 18, 2018, 07:56 AM
gci FILEPATH\out.txt|group|select Count,Name >FILEPATH\out-counted.txt

No you misunderstood. I don't want to count matches. I want to group them and output them in a separate file.

For example, I will search for my regex:
<html:producttype>(.+?)</html:producttype>
The possible matches will be:
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product2</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product3</html:producttype>
etc

I want the script to create one file with the matches where the (.+?) is the same, so:
1 file that contains:
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
1 file that contains:
<html:producttype>Product2</html:producttype>
and 1 file that contains:
<html:producttype>Product3</html:producttype>

Thanks!
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on September 18, 2018, 09:06 AM
What's the difference?

You either have 3 lines that say:

3  Product1
1  Product2
1  Product3

Or three files that contain lines that say:

File "Product1.txt"
Product1
Product1
Product1

File "Product2.txt"
Product2

File "Product3.txt"
Product3

Either way all you're getting is a count of how many times a match appears.
Title: Re: Extract REGEX matches from multiple text files
Post by: kalos on September 18, 2018, 09:21 AM
What's the difference?

You either have 3 lines that say:

3  Product1
1  Product2
1  Product3

Or three files that contain lines that say:

File "Product1.txt"
Product1
Product1
Product1

File "Product2.txt"
Product2

File "Product3.txt"
Product3

Either way all you're getting is a count of how many times a match appears.

No it's not the same, because the regex will be different! And I want to store the whole regex match in the file, which will be huge multiline text!
Title: Re: Extract REGEX matches from multiple text files
Post by: Ath on September 18, 2018, 12:43 PM
because the regex will be different! And I want to store the whole regex match in the file, which will be huge multiline text!
Whut? :o

We're back at square one. The circle is completed, again.

You have come here, asking for 'help'. Please provide us with what you want to achieve, and stop asking for small bits of silly info, with even more silly examples. This is not going to get you to a solution, as you obviously don't understand how problem-solving works.
I pretty sure I've said this before, a couple of weeks ago. :(
If you can't comply with that, I'd suggest all participants to ignore your requests until something useful comes out of your keyboard.
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on September 18, 2018, 05:26 PM
You expected anything more?

@kalos: Paste your complete PS script here as it is currently, not as a single line but as a correctly formatted script with no PS shortcuts.

FYI: 80% of what you seemingly want now is covered by this script: https://www.donationcoder.com/forum/index.php?topic=45945.msg422784#msg422784

Only thing missing is output to separate files which would be trivial to add ...
Title: Re: Extract REGEX matches from multiple text files
Post by: 4wd on September 23, 2018, 03:39 AM
Input
File 1: xml-test.xml
<?xml version="1.0" encoding="ISO8859-1" ?>
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod2">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD2</html:classificationType>
          <html:productType>PRD_XE2</html:productType>
          <html:productId>10005</html:productId>
          <html:assignedDate>2018-12-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS2</html:name>
          <html:Entity>REP_XE2</html:legalEntity>
          <html:location>ED2</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod3">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2013-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS3</html:name>
          <html:Entity>REP_XE3</html:legalEntity>
          <html:location>ED3</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XE4</html:productType>
          <html:productId>10567</html:productId>
          <html:assignedDate>2010-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS4</html:name>
          <html:Entity>REP_XE4</html:legalEntity>
          <html:location>ED4</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod5">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD5</html:classificationType>
          <html:productType>PRD_XE5</html:productType>
          <html:productId>10004890</html:productId>
          <html:assignedDate>2015-05-15</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS5</html:name>
          <html:Entity>REP_XE5</html:legalEntity>
          <html:location>ED5</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
</html:products>

File2: xml test2.xml
<?xml version="1.0" encoding="ISO8859-1" ?>
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-03-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REFUNDS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod2">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD2</html:classificationType>
          <html:productType>PRD_XE2</html:productType>
          <html:productId>10005</html:productId>
          <html:assignedDate>2015-12-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS2k12</html:name>
          <html:Entity>REP_XE2</html:legalEntity>
          <html:location>ED57</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod3">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XER3</html:productType>
          <html:productId>10014</html:productId>
          <html:assignedDate>2010-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>DESTRUCTION</html:name>
          <html:Entity>REP_XE3</html:legalEntity>
          <html:location>ED43</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod4">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XE4</html:productType>
          <html:productId>10567</html:productId>
          <html:assignedDate>1999-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>WHORU</html:name>
          <html:Entity>REP_XS4</html:legalEntity>
          <html:location>ED4</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod5">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD5</html:classificationType>
          <html:productType>PRD_XE5</html:productType>
          <html:productId>10004890</html:productId>
          <html:assignedDate>2115-12-15</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>SCREW_THIS</html:name>
          <html:Entity>REP_XE5</html:legalEntity>
          <html:location>ED5</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
</html:products>


[ You are not allowed to view attachments ]

[ You are not allowed to view attachments ]

Output
10004890.csv
Code: Text [Select]
  1. prod5,PRD5,PRD_XE5,10004890,2115-12-15,SCREW_THIS,REP_XE5,ED5
  2. prod5,PRD5,PRD_XE5,10004890,2015-05-15,REPAIRS5,REP_XE5,ED5

10004890.xml
Code: Text [Select]
  1. <html:prod id="prod5">
  2.       <html:referenceData>
  3.         <html:product>
  4.           <html:classificationType>PRD5</html:classificationType>
  5.           <html:productType>PRD_XE5</html:productType>
  6.           <html:productId>10004890</html:productId>
  7.           <html:assignedDate>2115-12-15</html:assignedDate>
  8.         </html:product>
  9.         <html:book>
  10.           <html:name>SCREW_THIS</html:name>
  11.           <html:Entity>REP_XE5</html:legalEntity>
  12.           <html:location>ED5</html:location>
  13.         </html:book>
  14.       </html:referenceData>
  15.    </html:prod>
  16. <html:prod id="prod5">
  17.       <html:referenceData>
  18.         <html:product>
  19.           <html:classificationType>PRD5</html:classificationType>
  20.           <html:productType>PRD_XE5</html:productType>
  21.           <html:productId>10004890</html:productId>
  22.           <html:assignedDate>2015-05-15</html:assignedDate>
  23.         </html:product>
  24.         <html:book>
  25.           <html:name>REPAIRS5</html:name>
  26.           <html:Entity>REP_XE5</html:legalEntity>
  27.           <html:location>ED5</html:location>
  28.         </html:book>
  29.       </html:referenceData>
  30.    </html:prod>


Code: PowerShell [Select]
  1. <#
  2. .NAME
  3.     XML-GUI.ps1
  4. #>
  5.  
  6. Add-Type -AssemblyName System.Windows.Forms
  7. [System.Windows.Forms.Application]::EnableVisualStyles()
  8.  
  9. #region begin GUI{
  10.  
  11. $Form                            = New-Object system.Windows.Forms.Form
  12. $Form.ClientSize                 = '246,178'
  13. $Form.text                       = "XML Mulcher"
  14. $Form.BackColor                  = "#cccccc"
  15. $Form.TopMost                    = $false
  16. $Form.FormBorderStyle            = 'Fixed3D'
  17. $Form.MaximizeBox                = $false
  18.  
  19. $TextBox1                        = New-Object system.Windows.Forms.TextBox
  20. $TextBox1.Text                   = ""
  21. $TextBox1.multiline              = $false
  22. $TextBox1.ReadOnly               = $true
  23. $TextBox1.Width                  = 185
  24. $TextBox1.height                 = 20
  25. $TextBox1.Location               = New-Object System.Drawing.Point(16,20)
  26. $TextBox1.Font                   = 'Microsoft Sans Serif,10'
  27.  
  28. $ListBox1                        = New-Object system.Windows.Forms.ListBox
  29. $ListBox1.text                   = ""
  30. $ListBox1.width                  = 100
  31. $ListBox1.height                 = 56
  32. @('Classification','ProductType','ProductID') | ForEach-Object {[void] $ListBox1.Items.Add($_)}
  33. $ListBox1.location               = New-Object System.Drawing.Point(16,50)
  34.  
  35. $Label1                          = New-Object system.Windows.Forms.Label
  36. $Label1.Text                     = "Processing:"
  37. $Label1.width                    = 68
  38. $Label1.height                   = 16
  39. $Label1.location                 = New-Object System.Drawing.Point(16,146)
  40. $Label1.Font                     = 'Microsoft Sans Serif,8'
  41.  
  42. $TextBox2                        = New-Object system.Windows.Forms.TextBox
  43. $TextBox2.multiline              = $false
  44. $TextBox2.ReadOnly               = $true
  45. $TextBox2.Width                  = 140
  46. $TextBox2.height                 = 16
  47. $TextBox2.Location               = New-Object System.Drawing.Point(88,144)
  48. $TextBox2.Font                   = 'Microsoft Sans Serif,8'
  49.  
  50. $Button1                         = New-Object system.Windows.Forms.Button
  51. $Button1.text                    = "Go"
  52. $Button1.width                   = 60
  53. $Button1.height                  = 30
  54. $Button1.location                = New-Object System.Drawing.Point(171,65)
  55. $Button1.Font                    = 'Microsoft Sans Serif,10'
  56.  
  57. $Button2                         = New-Object system.Windows.Forms.Button
  58. $Button2.text                    = "..."
  59. $Button2.width                   = 25
  60. $Button2.height                  = 25
  61. $Button2.location                = New-Object System.Drawing.Point(206,19)
  62. $Button2.Font                    = 'Microsoft Sans Serif,10'
  63.  
  64. $Label2                          = New-Object system.Windows.Forms.Label
  65. $Label2.Text                     = "Output:"
  66. $Label2.width                    = 60
  67. $Label2.height                   = 16
  68. $Label2.location                 = New-Object System.Drawing.Point(16,120)
  69. $Label2.Font                     = 'Microsoft Sans Serif,8'
  70.  
  71. $RadioButton1                    = New-Object system.Windows.Forms.RadioButton
  72. $RadioButton1.text               = "XML"
  73. $RadioButton1.AutoSize           = $true
  74. $RadioButton1.width              = 40
  75. $RadioButton1.height             = 16
  76. $RadioButton1.location           = New-Object System.Drawing.Point(88,118)
  77. $RadioButton1.Font               = 'Microsoft Sans Serif,8'
  78.  
  79. $RadioButton2                    = New-Object system.Windows.Forms.RadioButton
  80. $RadioButton2.text               = "CSV"
  81. $RadioButton2.Checked            = $true
  82. $RadioButton2.AutoSize           = $true
  83. $RadioButton2.width              = 40
  84. $RadioButton2.height             = 16
  85. $RadioButton2.location           = New-Object System.Drawing.Point(148,118)
  86. $RadioButton2.Font               = 'Microsoft Sans Serif,8'
  87.  
  88. $Form.controls.AddRange(@($ListBox1,$TextBox1,$Button1,$Button2,$Label1,$TextBox2,$Label2,$RadioButton1,$RadioButton2))
  89.  
  90. #region gui events {
  91. $Button1.Add_Click({
  92.   if ($TextBox1.Text -ne "") {
  93.     if ($ListBox1.SelectedItem -ne $null) {
  94.       Clear-Host
  95.       Set-Regex ($ListBox1.SelectedItem)
  96.     }
  97.   }
  98. })
  99.  
  100. $Button2.Add_Click({
  101.   $objForm = New-Object System.Windows.Forms.FolderBrowserDialog
  102.   $objForm.Description = "Select folder containing XML"
  103.   $objForm.SelectedPath = [System.Environment+SpecialFolder]'MyComputer'
  104.   $objForm.ShowNewFolderButton = $false
  105.   $result = $objForm.ShowDialog()
  106.   if ($result -eq "OK") {
  107.     $TextBox1.Text = $objForm.SelectedPath
  108.   } else {
  109.     $TextBox1.Text = ""
  110.   }
  111. })
  112.  
  113. #endregion events }
  114. #endregion GUI }
  115.  
  116.  
  117. #Write your logic code here
  118. Function Set-Regex {
  119.   param (
  120.     [string]$selItem
  121.   )
  122.   switch ($selItem) {
  123.     "Classification" { $regex = '(____________________________)(.+?)(___)' }
  124.     "ProductType" { $regex = '(_____________________)(.+?)(___)' }
  125.     "ProductID" { $regex = '(___________________)(.+?)(___)' }
  126.   }
  127.   Mulch-Files $regex
  128. }
  129.  
  130. Function Mulch-Files {
  131.   param (
  132.     [string]$pattern
  133.   )
  134.   $files = Get-ChildItem -Path ($TextBox1.Text + "\*.xml")
  135.   for ($h = 0; $h -lt $files.Count; $h++) {
  136.     $TextBox2.Text = $files[$h].Name
  137.     $products = (Get-Content $files[$h] -Raw) -_____ '(____)^.*?(____________________________)'
  138.     for ($i = 1; $i -lt $products.Count; $i += 2) {
  139.       $products[$i] -_____ '(_________)(.+?)(___)'
  140.       $prod = $Matches[0]
  141.       $temp = $products[$i] -_____ $pattern
  142.       for ($j = 0; $j -lt $temp.Count; $j++) {
  143.         if ($RadioButton2.Checked) {
  144.           $outFile = $Matches[0] + ".csv"
  145.           $outText = ($prod + (((($products[$i] -replace '(<[^>]+>|\s)', ',' ) -replace '`r', '') -replace '`n', '') -replace '(,)(,)+', '$1').TrimEnd(','))
  146.         } else {
  147.           $outFile = $Matches[0] + ".xml"
  148.           $outText = $products[$i]
  149.         }
  150.         Out-File -FilePath $outFile -InputObject $outText -Append
  151.       }
  152.     }
  153.   }
  154.   $TextBox2.Text = "Finished"
  155. }
  156.  
  157. [void]$Form.ShowDialog()