Extract REGEX matches from multiple text files

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

<< < (15/22) > >>

kalos:
And one last question... when you say duplicate, you mean the whole record is duplicated? Or just some of the fields, i.e. productID or prod id?
-wraith808 (August 13, 2018, 10:05 AM)
--- End quote ---

Some fields, eg there may be more than one assignedDate value, so the script will need to process these additional fields for the same prod.

The pseudocode I am looking for is like this:
1) search for the first 'prod' section of the file, convert it to single line, extract the appropriate regex (all matches) one after the other (that's why I want to specify the all the regex matches that I want the script to search for when scanning the line, as I am not sure which order they will be - it shouldn't change but just in case)
2) then find the next 'prod' section in the file, convert it to single line and put it in a line below the previous, then extract the regexes one by one

Any hint?

I tried to use ¦ to add OR regex matches, but I think it didn't work.

Ath:
OK, so the input is:
<html:products>
<html:prod id="prod1">
<html:referenceData>
<html:product>
<html:classificationType>PRD</html:classificationType>
<html:productType>PRD_XE</html:productType>
<html:productId>10004</html:productId>
<html:assignedDate>2018-07-23</html:assignedDate>
</html:product>
<html:book>
<html:name>REPAIRS</html:name>
<html:Entity>REP_XE</html:legalEntity>
<html:location>ED</html:location>
</html:book>
</html:referenceData>
</html:prod>

The above continues to prod2 etc.

The output of the data would be:
prod1; PRD; PRD_XE; 10004; 2018-07-23; REPAIRS; REP_XE; ED
Then a new line would start with:
prod2; etc

-kalos (August 13, 2018, 05:03 AM)
--- End quote ---
That finally makes some sense. Here is an example solution for putting that into a .csv formatted file.
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)

But, earlier in this thread you wrote this:
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
-kalos (August 06, 2018, 03:26 AM)
--- End quote ---
Well guys, the data is what I posted in my last post (Plants),
-kalos (August 06, 2018, 07:01 AM)
--- End quote ---

--- End quote ---
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?

However, I want to convert the input data in a string, because, I may need to match longer substrings than eg "<html:classificationType>(.+?)</html:classificationType>"
-kalos (August 13, 2018, 05:03 AM)
--- End quote ---
You are talking b.s. here.

Also, I think there may be duplicates for each prod, e.g. more than one assignedDate node with different values, so MatchAll would be best.
-kalos (August 13, 2018, 05:03 AM)
--- End quote ---
This doesn't make sense without an example, and MatchAll is inappropriate here.

extract the appropriate regex
-kalos (August 13, 2018, 10:47 AM)
--- End quote ---
PLEASE STOP TELLING US HOW TO SOLVE YOUR CHALLENGE!
(This could have been bigger and in red, but I'm trying to stay nice, so I didn't)
If you want to learn regex, go get a book or on-line course, there are plenty here and here, and stop feeding us xml.

When handling XML, no regexes are usually involved, unless the data elements contain 'complex', somewhat structured, data that needs to be broken down.

I have this assignment for you:
- read the entire thread from OP to the end and formulate an answer to all unanswered questions we asked you. (Just quote the question and type the answer below the quote)
After all the answers are given you can ask 1 new question. As 4wd already stated, and you said yourself but in other words, you aren't good in answering questions, but it is required for other people to help you solve your challenge/quest.

4wd:
You didn't give the specification for that html: namespace though. (But as it's the only namespace used, for data-extraction it can be filtered out)-Ath (August 13, 2018, 02:05 PM)
--- End quote ---

Or by installing an updated XML module which will give you Remove-XmlNamespace

Still not convinced it's sufficient information since somewhere there should have been a Namespace declaration I would have thought, he's only giving the information for a record within the file.

Well guys, the data is what I posted in my last post (Plants),
-kalos (August 06, 2018, 07:01 AM)
--- End quote ---

--- End quote ---
It doesn't even look a teensy bit like this new data you've given just now, are you playing us?
--- End quote ---

PS: This is not related to the initial data file I wanted to process.-kalos (August 09, 2018, 10:53 AM)
--- End quote ---

So far:

* XML,
* XML with Namespace, (file sizes unknown),
* CDATA
... and counting ...

It's another problem, jumping from one thing to another without getting any one thing completed.

kalos:
I don't understand why you do not answer my specific questions, regardless of the source data format and the desired output. Is what I am asking not possible to be done with Powershell?

For example, I want to perform a regex match that will output all matches of regex1 and regex2 and regex3.

How can I do that?

Ath:
How can I do that?
-kalos (August 20, 2018, 08:37 AM)
--- End quote ---
That's why we asked more specific questions, but you never answered them.
So then I gave you the assignment of answering all our unanswered questions, but you haven't done that up until now, so basically, we are waiting (but not holding our breath) for your answers, before accepting new questions. :(

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version