topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 1:31 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Parsing / Filtering text  (Read 7066 times)

RedPillow

  • Member
  • Joined in 2008
  • **
  • Posts: 141
  • Pillows.
    • View Profile
    • Read more about this member.
    • Donate to Member
Parsing / Filtering text
« on: February 25, 2010, 04:27 AM »
Yo again, another topic I need help with.

I have this big list in a txt-file which contains paths.

Example paths:

Motorcycle\crap\morecrap
Car\crap\morecrap
Truck\crap\morecrap
Bike\crap\morecrap

And now, I want to delete everything after the first "\" so the lines will look like this:

Motorcycle
Car
Truck
Bike

How do I do this?
What program to use?
Possibly a script?
It can`t be done with notepad`s search & replace command like this:

Search: \*
Replace: -empty-

Cause it tries to find "\*" and not "\anything".

Suggestions?

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: Parsing / Filtering text
« Reply #1 on: February 25, 2010, 04:44 AM »
If you're familiar with RegEx at all, here is a way to do it in AutoHotkey.  You could easily adapt the RegEx portion to your own code or a more capable editor that has RegEx search capabilities.

Code: AutoIt [Select]
  1. Text =
  2. (
  3. Motorcycle\crap\morecrap
  4. Car\crap\morecrap
  5. Truck\crap\morecrap
  6. Bike\crap\morecrap
  7. )
  8.  
  9. Loop, Parse, Text, `n
  10. {
  11.     If ( A_LoopField != "" )
  12.     {
  13.         RegExMatch( A_LoopField, "^(.+?)\\", SubPat )
  14.         Block .= SubPat1 . "`r`n"
  15.     }
  16. }
  17.  
  18. MsgBox, % Block

ewemoa

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 2,922
    • View Profile
    • Donate to Member
Re: Parsing / Filtering text
« Reply #2 on: February 25, 2010, 05:16 AM »
Here's one way to use Notepad++ for this task.

Edited the images -- should be easier for subsequent viewings :)

Open file in Notepad++
1. Open file in Notepad++.pngParsing / Filtering text

Choose Search -> Replace
2. Choose Search -_ Replace.pngParsing / Filtering text

Ensure Cursor is at Beginning of Text and Select "Regular expression" for Search Mode
3. Ensure Cursor is at Beginning of Text and Select _Regular expression_ for Search Mode.pngParsing / Filtering text

Fill in Appropriately Values for "Find what" and "Replace with"
4. Fill in Appropriately Values for _Find what_ and _Replace with_.pngParsing / Filtering text

Click "Replace All" Button
5. Click _Replace All_ Button.pngParsing / Filtering text

Examine the Results
6. Examine the Results.pngParsing / Filtering text
« Last Edit: February 25, 2010, 08:59 PM by ewemoa »

RedPillow

  • Member
  • Joined in 2008
  • **
  • Posts: 141
  • Pillows.
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: Parsing / Filtering text
« Reply #3 on: February 25, 2010, 05:39 AM »
Nice one ewemoa!!

I had to replace ¥`s with \`s thou :]

Can you break this ^([^\\]+)\\.* apart and explain what each thing there does and the \1 also?

skwire

  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 5,287
    • View Profile
    • Donate to Member
Re: Parsing / Filtering text
« Reply #4 on: February 25, 2010, 05:53 AM »
I had to replace ¥`s with \`s thou :]

The yen symbols are actually backslashes on ewemoa's (and my) computer.  It's a side effect of using Japanese as the default language on a English Windows box.  I've become so accustomed to it over the years that my eyes don't even "see" them as yen symbols anymore.

ewemoa

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 2,922
    • View Profile
    • Donate to Member
Re: Parsing / Filtering text
« Reply #5 on: February 25, 2010, 06:30 AM »
Sorry, the environment I was using was non-English (ah, skwire has explained already) -- but you figured out the appropriate replacement :)

Copy-pasting much text from http://regularexpression.info/:

^([^\\]+)\\.*

Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.

^([^\\]+)\\.*

Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.

^([^\\]+)\\.*

Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply.  Note: in this case, the closing square bracket ends the character class in question.

^([^\\]+)\\.*

Negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [)

^([^\\]+)\\.*

A backslash escapes special characters to suppress their special meaning.  Wanted to express backslash, but the backslash character has a special meaning in these contexts, so had to "escape" them using a backslash character in each case.

^([^\\]+)\\.*

Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.  "Previous item" here means the character class of non-backslash characters.

^([^\\]+)\\.*

Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too.

^([^\\]+)\\.*

Repeats the previous item zero or more times. "Previous item" here refers to the dot.  Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.  

^([^\\]+)\\.*

Bringing the pieces together, one English translation might be:

Match a line which:

starts with a sequence of non-backslash characters (and, oh, let's hold on to this for later reference [1]),
continues with at least one backslash character,
and further continues with some text which we don't really care about

As for the replacement portion:

\1

Substituted with the text matched between the 1st through 9th pair of capturing parentheses.  In the case in question, there is only one pair of capturing parentheses and they captured a sequence of non-backslash characters at the beginning of a line.

This description is not as complete as it might be, but perhaps it will suffice.


[1] Referred to as a "backreference".

ewemoa

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 2,922
    • View Profile
    • Donate to Member
Re: Parsing / Filtering text
« Reply #6 on: March 07, 2010, 09:23 PM »
Here are some samples of using "The Regex Coach" to study regular expressions:

Specify Regular Expression and Target String
1. Specify Regular Expression and Target String.pngParsing / Filtering text

Observe Tree Analysis of Regular Expression
2. Observe Tree Analysis of Regular Expression.pngParsing / Filtering text

Specify Replacement String
3. Specify Replacement String.pngParsing / Filtering text

Step Through Regular Expression Evaluation
4. Step Through Regular Expression Evaluation.pngParsing / Filtering text