topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Tuesday March 19, 2024, 3:11 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Last post Author Topic: NANY 2019: RegexCaptor - Simple app to extract email or other patterns from text  (Read 42080 times)

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
This is a very simple beta release of a program to extract email addresses or other regular expressions from text files.

Screenshot - 5_4_2018 , 6_30_19 PM.png
Screenshot - 5_4_2018 , 6_37_13 PM.png



Motivation:

This is a very simple task.  I needed to extract email addresses from bounced emails in order to remove them from the donationcoder mailing lists.  This is a fairly simple task for a commandline regular expression extractor tool, but I like to be able to drag+drop and get some visual interaction.

I tried a few "free" tools for doing this and they were ALL adware, shareware, feature limited.  Just horrible.  I don't know when we got to a point where people think they can list software unambiguously as "free" and have it be filled with adware or be horribly crippled until you buy the full version.  :down:

So I decided to write my own tool, with hopes for improving it.  The goals are similar to CodeByters Linebyter which I have used in the past but whose source code was lost.

Again this is a very simple tool, it has a few minor features that make it useful for specific tasks:
  • You can create your own list of common regular expression search patterns and select between them easily.
  • You can specify a portion of the expression that should be extracted and listed.
  • You can specify additional patterns to be ignored (in regex or plaintext format).
  • The final list is sorted and duplicates removed.
  • Easy to search multiple files; remembers file list.

Again this is a very niche tool but I may add features to it to make it more useful for other tasks.  If you already have a good regular expression "extractor" that you are happy with, this is unlikely to replace it.
« Last Edit: December 02, 2018, 08:56 PM by mouser »

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,540
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Oooh! That's rather nifty. Could come in rather handy. Thankyou.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,540
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Works rather well, and has helpful suggestions/favourites, etc.    :Thmbsup:

By the way:
  • it shows up in DcUpdater as:     v1.01.9, with no web version (yet).
  • the executable file is shown as: v1.1.2.0
  • the GUI "About" says it is:        v1.01.02
  • the Help file (Overview) says:          v1.01.01 - Apr 23, 2013 <-- !
  • the Help file (Version History) says:   v1.01.01 - Apr 23, 2018
« Last Edit: May 04, 2018, 09:47 PM by IainB »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,610
    • View Profile
    • Donate to Member
Sounds like a useful tool, much more convenient than extracting stuff using notepad++ :Thmbsup:

The goals are similar to CodeByters Linebyter which I have used in the past but whose source code was lost.
That link isn't accessible for me, but that can be just me...


wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Nope.  Not just you.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
I have corrected the link to the old Linebyter program.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
As of now, Regex Captor doesn't provide much if anything that other tools don't provide.

But I am open to feature requests if there are problems it might solve that other tools don't -- it would be fun if we could figure out some features that made the tool genuinely useful over other tools.
« Last Edit: May 05, 2018, 11:17 AM by mouser »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,610
    • View Profile
    • Donate to Member
A useful feature could be to apply some automatic formatting to the results list, maybe similar to regex replace, where you can re-use the found result to create complete new texts:
Template:
Code: Text [Select]
  1. <mailout name="\1" type="mailto">mailto://\1</mailout>
Resultlist:
Code: Text [Select]
  1. <mailout name="[email protected]" type="mailto">mailto://[email protected]</mailout>
  2. <mailout name="[email protected]" type="mailto">mailto://[email protected]</mailout>

(OT: The code=xml or code=html forum tags don't seem to work as intended, reverting to 'text' for now)

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
I think that's a great idea, I will add it.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member

BGM

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 562
    • View Profile
    • bgmCoder DC
    • Read more about this member.
    • Donate to Member
I always wanted a regex tester that had syntax colouring.  It hurts my brain to try and figure out all the patterns.  I've thought of even writing my own.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
I always wanted a regex tester that had syntax colouring.  It hurts my brain to try and figure out all the patterns.  I've thought of even writing my own.

http://www.regexbuddy.com/

It's saved my sanity.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,610
    • View Profile
    • Donate to Member
Or if you want a free (but online) solution: https://regex101.com

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,186
    • View Profile
    • Donate to Member
Or if you want a free (but online) solution: https://regex101.com

The thing that really helped me that was missing in those online solutions wa the ability to build regexes with parts and to see them broken down.  I use some pretty complicated regexes, and though they can be parsed in those online tools, debugging them was still a pain.

Contro

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 3,940
    • View Profile
    • Donate to Member
 :-*
Nice work.  :-* :P

One of the links fails https://www.donationcoder.com/404
or is in wrong position
There is a problem with the options :
Help-Visit Program Homepage
Help-Visit Program Forum



osensnolf

  • Supporting Member
  • Joined in 2019
  • **
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
I joined just to reply - I ran the program to extract emails from a 2.6GB file but I keep getting this error.

External exception EEFFACE

This happens less than 10% through my search.

Another option to add - allow me to turn off Results Preview.  For a large result, that consumes a lot of time when really all I want is the exported results.

How much $$ to get this resolved?

I downloaded the version in the original link.  I am a Windows 7 user - is there a better option I should use if my only goal is to extract emails from multiple large files that I have?

THanks
« Last Edit: January 17, 2019, 11:33 AM by osensnolf »

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Let me give it an update and see if I can reproduce the problem with a huge file.
Option to disable preview is a good idea..

Let me ask, can you see about how many results it found when it hit that error? It might give me a clue.

osensnolf

  • Supporting Member
  • Joined in 2019
  • **
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
The file contains 119m rows of data at 2.6GB txt

It stops at.. Found 7968374 results. Scanning line 7970000

On row 7969999 (1 before), I see the following email.
[email protected]

I removed that value and saved and ran the test again but still it stopped at the same place like clockwork.

What's odd is that I downloaded another email extractor (much slower) and it too locks up around the same time but I do not know which row it is on.  Storage and memory are not an issue.

I will try now with a larger and then a smaller file.

UPDATE with 3.5GB File with 27m rows, same error.
Found 9792207 results scanning line 11800000
« Last Edit: January 17, 2019, 01:20 PM by osensnolf »

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
I noticed when I last used it that it also seemed to be reporting an unusually high number of results, so it seems it's time for me to release an update.  I'll try to get one done in the next few days.

No excuse for it throwing that exception -- it sounds like I will be able to reproduce and solve it if I just use a big enough input file.  Standby.

ps. I think the default email regular expression I have in the program is not great -- it might be nice to find a better one.

osensnolf

  • Supporting Member
  • Joined in 2019
  • **
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
If you need me to test it, you can send it to me on PM.  I will pay once it is confirmed to work as I would like to have it ASAP.

Other than knowing what they are, I do not know anything about regular expressions so I'll avoid pasting something that I find online as I"m sure you will be able to find a better solution faster than me.

But yes, disabling the preview is a must.  Even with a smaller file more time is spent populating the preview than actually getting the results.

Thank you!

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Will do.  I'll try to get a new version up tomorrow or this weekend.

osensnolf

  • Supporting Member
  • Joined in 2019
  • **
  • default avatar
  • Posts: 6
    • View Profile
    • Donate to Member
Looking forward to it - very much.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
ps the problem is related to the number of results not the size of the input file (tested with 12 gb files)..

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
So probably the problem is too many results -- either running out of memory for the raw results, or the display of results, or the sorting/de-duping operation.
I have a few options -- ranging from the most proper (64bit build which can use more memory) to the most kludgey (limit number of results found on a given pass and make user do multiple passes), to various things in between.  Thinking on it.

HOWEVER, another solution is suggested by the fact that this regular expression pattern I provide by default is so imperfect and hits on many things that don't look like well-formed emails.  A better regular expression that had less bad hits would also solve the problem.  So for example if we limited the results to those with the most common extensions (.com, .org, etc...) you could probably eliminate most of the bad results and solve the problem of an overflow of results at the same time.

In fact, if extracting emails is the most common use for the app, I could actually hardcode some *proper* (non regular expression) email address validator code that could be an option, which would do that.  Then we could use a quick and dirty regular expression to find email address candidates, followed by a proper email address validator algorithm. I'll probably add that just for fun.

mouser

  • First Author
  • Administrator
  • Joined in 2005
  • *****
  • Posts: 40,896
    • View Profile
    • Mouser's Software Zone on DonationCoder.com
    • Read more about this member.
    • Donate to Member
Working on it now..