topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • September 25, 2018, 06:51 AM
  • Proudly celebrating 13 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: Extract REGEX matches from multiple text files  (Read 4968 times)

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,250
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #100 on: September 12, 2018, 10:10 AM »
What could be the problem?
You haven't shared the file, so we'll never know, unless...

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,672
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #101 on: September 13, 2018, 05:10 AM »
What could be the problem?
You haven't shared the file, so we'll never know, unless...

I made it work like that:
gci FILEPATH | sls -AllMatches '<html:productType>(.+?)<\/html:productType>' | % { $_.Matches } | % { $_.Groups[1].Value } >> FILEPATH\out.txt

But I don't know how I made it work lol, can you spot the error? Also, I know I asked before, but can you point me to somewhere that explains % { $_.Matches } | % { $_.Groups[1].Value } ?
I think % means 'for every' and $_.Matches is the object variable of the matches, while $_.Groups[1].Value is the content value of the matches objects, right? But what is [1]?

UPDATE: it seems both work, but which would be better?
Thanks!
« Last Edit: September 13, 2018, 07:46 AM by kalos »

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,672
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #102 on: September 17, 2018, 05:03 AM »
Guys, after I search for regex matches in a text, how can I group the matches to separate files, by same reference inside the regex match?

For example, for every regex match <html:producttype>(.+?)</html:producttype>, I want to output to a separate file all the matches where the (.+?) is the same.

Any idea? Also, please explain the strategy/pseudocode to see how that would work.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,250
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #103 on: September 17, 2018, 12:59 PM »
I want to output to a separate file all the matches where the (.+?) is the same.
Run a second command on your previous output
Code: PowerShell [Select]
  1. gci FILEPATH\out.txt|group|select Count,Name >FILEPATH\out-counted.txt
The code is also the pseudo code.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,672
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #104 on: September 18, 2018, 07:56 AM »
gci FILEPATH\out.txt|group|select Count,Name >FILEPATH\out-counted.txt

No you misunderstood. I don't want to count matches. I want to group them and output them in a separate file.

For example, I will search for my regex:
<html:producttype>(.+?)</html:producttype>
The possible matches will be:
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product2</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product3</html:producttype>
etc

I want the script to create one file with the matches where the (.+?) is the same, so:
1 file that contains:
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
<html:producttype>Product1</html:producttype>
1 file that contains:
<html:producttype>Product2</html:producttype>
and 1 file that contains:
<html:producttype>Product3</html:producttype>

Thanks!

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,954
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #105 on: September 18, 2018, 09:06 AM »
What's the difference?

You either have 3 lines that say:

3  Product1
1  Product2
1  Product3

Or three files that contain lines that say:

File "Product1.txt"
Product1
Product1
Product1

File "Product2.txt"
Product2

File "Product3.txt"
Product3

Either way all you're getting is a count of how many times a match appears.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,672
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #106 on: September 18, 2018, 09:21 AM »
What's the difference?

You either have 3 lines that say:

3  Product1
1  Product2
1  Product3

Or three files that contain lines that say:

File "Product1.txt"
Product1
Product1
Product1

File "Product2.txt"
Product2

File "Product3.txt"
Product3

Either way all you're getting is a count of how many times a match appears.

No it's not the same, because the regex will be different! And I want to store the whole regex match in the file, which will be huge multiline text!

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,250
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #107 on: September 18, 2018, 12:43 PM »
because the regex will be different! And I want to store the whole regex match in the file, which will be huge multiline text!
Whut? :o

We're back at square one. The circle is completed, again.

You have come here, asking for 'help'. Please provide us with what you want to achieve, and stop asking for small bits of silly info, with even more silly examples. This is not going to get you to a solution, as you obviously don't understand how problem-solving works.
I pretty sure I've said this before, a couple of weeks ago. :(
If you can't comply with that, I'd suggest all participants to ignore your requests until something useful comes out of your keyboard.

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,954
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #108 on: September 18, 2018, 05:26 PM »
You expected anything more?

@kalos: Paste your complete PS script here as it is currently, not as a single line but as a correctly formatted script with no PS shortcuts.

FYI: 80% of what you seemingly want now is covered by this script: http://www.donationc....msg422784#msg422784

Only thing missing is output to separate files which would be trivial to add ...

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 4,954
    • View Profile
    • Donate to Member
Re: Extract REGEX matches from multiple text files
« Reply #109 on: September 23, 2018, 03:39 AM »
Input
File 1: xml-test.xml
<?xml version="1.0" encoding="ISO8859-1" ?>
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod2">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD2</html:classificationType>
          <html:productType>PRD_XE2</html:productType>
          <html:productId>10005</html:productId>
          <html:assignedDate>2018-12-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS2</html:name>
          <html:Entity>REP_XE2</html:legalEntity>
          <html:location>ED2</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod3">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2013-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS3</html:name>
          <html:Entity>REP_XE3</html:legalEntity>
          <html:location>ED3</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XE4</html:productType>
          <html:productId>10567</html:productId>
          <html:assignedDate>2010-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS4</html:name>
          <html:Entity>REP_XE4</html:legalEntity>
          <html:location>ED4</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod5">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD5</html:classificationType>
          <html:productType>PRD_XE5</html:productType>
          <html:productId>10004890</html:productId>
          <html:assignedDate>2015-05-15</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS5</html:name>
          <html:Entity>REP_XE5</html:legalEntity>
          <html:location>ED5</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
</html:products>

File2: xml test2.xml
<?xml version="1.0" encoding="ISO8859-1" ?>
<html:products>
    <html:prod id="prod1">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD</html:classificationType>
          <html:productType>PRD_XE</html:productType>
          <html:productId>10004</html:productId>
          <html:assignedDate>2018-03-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REFUNDS</html:name>
          <html:Entity>REP_XE</html:legalEntity>
          <html:location>ED</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod2">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD2</html:classificationType>
          <html:productType>PRD_XE2</html:productType>
          <html:productId>10005</html:productId>
          <html:assignedDate>2015-12-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>REPAIRS2k12</html:name>
          <html:Entity>REP_XE2</html:legalEntity>
          <html:location>ED57</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod3">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XER3</html:productType>
          <html:productId>10014</html:productId>
          <html:assignedDate>2010-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>DESTRUCTION</html:name>
          <html:Entity>REP_XE3</html:legalEntity>
          <html:location>ED43</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod4">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD4</html:classificationType>
          <html:productType>PRD_XE4</html:productType>
          <html:productId>10567</html:productId>
          <html:assignedDate>1999-07-23</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>WHORU</html:name>
          <html:Entity>REP_XS4</html:legalEntity>
          <html:location>ED4</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
    <html:prod id="prod5">
      <html:referenceData>
        <html:product>
          <html:classificationType>PRD5</html:classificationType>
          <html:productType>PRD_XE5</html:productType>
          <html:productId>10004890</html:productId>
          <html:assignedDate>2115-12-15</html:assignedDate>
        </html:product>
        <html:book>
          <html:name>SCREW_THIS</html:name>
          <html:Entity>REP_XE5</html:legalEntity>
          <html:location>ED5</html:location>
        </html:book>
      </html:referenceData>
   </html:prod>
</html:products>


2018-09-22 12_44_50-XML Mulcher.png

2018-09-22 12_46_07-K__.png

Output
10004890.csv
Code: Text [Select]
  1. prod5,PRD5,PRD_XE5,10004890,2115-12-15,SCREW_THIS,REP_XE5,ED5
  2. prod5,PRD5,PRD_XE5,10004890,2015-05-15,REPAIRS5,REP_XE5,ED5

10004890.xml
Code: Text [Select]
  1. <html:prod id="prod5">
  2.       <html:referenceData>
  3.         <html:product>
  4.           <html:classificationType>PRD5</html:classificationType>
  5.           <html:productType>PRD_XE5</html:productType>
  6.           <html:productId>10004890</html:productId>
  7.           <html:assignedDate>2115-12-15</html:assignedDate>
  8.         </html:product>
  9.         <html:book>
  10.           <html:name>SCREW_THIS</html:name>
  11.           <html:Entity>REP_XE5</html:legalEntity>
  12.           <html:location>ED5</html:location>
  13.         </html:book>
  14.       </html:referenceData>
  15.    </html:prod>
  16. <html:prod id="prod5">
  17.       <html:referenceData>
  18.         <html:product>
  19.           <html:classificationType>PRD5</html:classificationType>
  20.           <html:productType>PRD_XE5</html:productType>
  21.           <html:productId>10004890</html:productId>
  22.           <html:assignedDate>2015-05-15</html:assignedDate>
  23.         </html:product>
  24.         <html:book>
  25.           <html:name>REPAIRS5</html:name>
  26.           <html:Entity>REP_XE5</html:legalEntity>
  27.           <html:location>ED5</html:location>
  28.         </html:book>
  29.       </html:referenceData>
  30.    </html:prod>


Code: PowerShell [Select]
  1. <#
  2. .NAME
  3.     XML-GUI.ps1
  4. #>
  5.  
  6. Add-Type -AssemblyName System.Windows.Forms
  7. [System.Windows.Forms.Application]::EnableVisualStyles()
  8.  
  9. #region begin GUI{
  10.  
  11. $Form                            = New-Object system.Windows.Forms.Form
  12. $Form.ClientSize                 = '246,178'
  13. $Form.text                       = "XML Mulcher"
  14. $Form.BackColor                  = "#cccccc"
  15. $Form.TopMost                    = $false
  16. $Form.FormBorderStyle            = 'Fixed3D'
  17. $Form.MaximizeBox                = $false
  18.  
  19. $TextBox1                        = New-Object system.Windows.Forms.TextBox
  20. $TextBox1.Text                   = ""
  21. $TextBox1.multiline              = $false
  22. $TextBox1.ReadOnly               = $true
  23. $TextBox1.Width                  = 185
  24. $TextBox1.height                 = 20
  25. $TextBox1.Location               = New-Object System.Drawing.Point(16,20)
  26. $TextBox1.Font                   = 'Microsoft Sans Serif,10'
  27.  
  28. $ListBox1                        = New-Object system.Windows.Forms.ListBox
  29. $ListBox1.text                   = ""
  30. $ListBox1.width                  = 100
  31. $ListBox1.height                 = 56
  32. @('Classification','ProductType','ProductID') | ForEach-Object {[void] $ListBox1.Items.Add($_)}
  33. $ListBox1.location               = New-Object System.Drawing.Point(16,50)
  34.  
  35. $Label1                          = New-Object system.Windows.Forms.Label
  36. $Label1.Text                     = "Processing:"
  37. $Label1.width                    = 68
  38. $Label1.height                   = 16
  39. $Label1.location                 = New-Object System.Drawing.Point(16,146)
  40. $Label1.Font                     = 'Microsoft Sans Serif,8'
  41.  
  42. $TextBox2                        = New-Object system.Windows.Forms.TextBox
  43. $TextBox2.multiline              = $false
  44. $TextBox2.ReadOnly               = $true
  45. $TextBox2.Width                  = 140
  46. $TextBox2.height                 = 16
  47. $TextBox2.Location               = New-Object System.Drawing.Point(88,144)
  48. $TextBox2.Font                   = 'Microsoft Sans Serif,8'
  49.  
  50. $Button1                         = New-Object system.Windows.Forms.Button
  51. $Button1.text                    = "Go"
  52. $Button1.width                   = 60
  53. $Button1.height                  = 30
  54. $Button1.location                = New-Object System.Drawing.Point(171,65)
  55. $Button1.Font                    = 'Microsoft Sans Serif,10'
  56.  
  57. $Button2                         = New-Object system.Windows.Forms.Button
  58. $Button2.text                    = "..."
  59. $Button2.width                   = 25
  60. $Button2.height                  = 25
  61. $Button2.location                = New-Object System.Drawing.Point(206,19)
  62. $Button2.Font                    = 'Microsoft Sans Serif,10'
  63.  
  64. $Label2                          = New-Object system.Windows.Forms.Label
  65. $Label2.Text                     = "Output:"
  66. $Label2.width                    = 60
  67. $Label2.height                   = 16
  68. $Label2.location                 = New-Object System.Drawing.Point(16,120)
  69. $Label2.Font                     = 'Microsoft Sans Serif,8'
  70.  
  71. $RadioButton1                    = New-Object system.Windows.Forms.RadioButton
  72. $RadioButton1.text               = "XML"
  73. $RadioButton1.AutoSize           = $true
  74. $RadioButton1.width              = 40
  75. $RadioButton1.height             = 16
  76. $RadioButton1.location           = New-Object System.Drawing.Point(88,118)
  77. $RadioButton1.Font               = 'Microsoft Sans Serif,8'
  78.  
  79. $RadioButton2                    = New-Object system.Windows.Forms.RadioButton
  80. $RadioButton2.text               = "CSV"
  81. $RadioButton2.Checked            = $true
  82. $RadioButton2.AutoSize           = $true
  83. $RadioButton2.width              = 40
  84. $RadioButton2.height             = 16
  85. $RadioButton2.location           = New-Object System.Drawing.Point(148,118)
  86. $RadioButton2.Font               = 'Microsoft Sans Serif,8'
  87.  
  88. $Form.controls.AddRange(@($ListBox1,$TextBox1,$Button1,$Button2,$Label1,$TextBox2,$Label2,$RadioButton1,$RadioButton2))
  89.  
  90. #region gui events {
  91. $Button1.Add_Click({
  92.   if ($TextBox1.Text -ne "") {
  93.     if ($ListBox1.SelectedItem -ne $null) {
  94.       Clear-Host
  95.       Set-Regex ($ListBox1.SelectedItem)
  96.     }
  97.   }
  98. })
  99.  
  100. $Button2.Add_Click({
  101.   $objForm = New-Object System.Windows.Forms.FolderBrowserDialog
  102.   $objForm.Description = "Select folder containing XML"
  103.   $objForm.SelectedPath = [System.Environment+SpecialFolder]'MyComputer'
  104.   $objForm.ShowNewFolderButton = $false
  105.   $result = $objForm.ShowDialog()
  106.   if ($result -eq "OK") {
  107.     $TextBox1.Text = $objForm.SelectedPath
  108.   } else {
  109.     $TextBox1.Text = ""
  110.   }
  111. })
  112.  
  113. #endregion events }
  114. #endregion GUI }
  115.  
  116.  
  117. #Write your logic code here
  118. Function Set-Regex {
  119.   param (
  120.     [string]$selItem
  121.   )
  122.   switch ($selItem) {
  123.     "Classification" { $regex = '(____________________________)(.+?)(___)' }
  124.     "ProductType" { $regex = '(_____________________)(.+?)(___)' }
  125.     "ProductID" { $regex = '(___________________)(.+?)(___)' }
  126.   }
  127.   Mulch-Files $regex
  128. }
  129.  
  130. Function Mulch-Files {
  131.   param (
  132.     [string]$pattern
  133.   )
  134.   $files = Get-ChildItem -Path ($TextBox1.Text + "\*.xml")
  135.   for ($h = 0; $h -lt $files.Count; $h++) {
  136.     $TextBox2.Text = $files[$h].Name
  137.     $products = (Get-Content $files[$h] -Raw) -_____ '(____)^.*?(____________________________)'
  138.     for ($i = 1; $i -lt $products.Count; $i += 2) {
  139.       $products[$i] -_____ '(_________)(.+?)(___)'
  140.       $prod = $Matches[0]
  141.       $temp = $products[$i] -_____ $pattern
  142.       for ($j = 0; $j -lt $temp.Count; $j++) {
  143.         if ($RadioButton2.Checked) {
  144.           $outFile = $Matches[0] + ".csv"
  145.           $outText = ($prod + (((($products[$i] -replace '(<[^>]+>|\s)', ',' ) -replace '`r', '') -replace '`n', '') -replace '(,)(,)+', '$1').TrimEnd(','))
  146.         } else {
  147.           $outFile = $Matches[0] + ".xml"
  148.           $outText = $products[$i]
  149.         }
  150.         Out-File -FilePath $outFile -InputObject $outText -Append
  151.       }
  152.     }
  153.   }
  154.   $TextBox2.Text = "Finished"
  155. }
  156.  
  157. [void]$Form.ShowDialog()
« Last Edit: September 23, 2018, 09:42 AM by 4wd »