Author Topic: NANY 2019: RegexCaptor - Simple app to extract email or other patterns from text (Read 60396 times)

mouser · « **on:** May 04, 2018, 06:36 PM »

This is a very simple beta release of a program to extract email addresses or other regular expressions from text files.

Motivation:

This is a very simple task. I needed to extract email addresses from bounced emails in order to remove them from the donationcoder mailing lists. This is a fairly simple task for a commandline regular expression extractor tool, but I like to be able to drag+drop and get some visual interaction.

I tried a few "free" tools for doing this and they were ALL adware, shareware, feature limited. Just horrible. I don't know when we got to a point where people think they can list software unambiguously as "free" and have it be filled with adware or be horribly crippled until you buy the full version.

So I decided to write my own tool, with hopes for improving it. The goals are similar to CodeByters Linebyter which I have used in the past but whose source code was lost.

Again this is a very simple tool, it has a few minor features that make it useful for specific tasks:

You can create your own list of common regular expression search patterns and select between them easily.
You can specify a portion of the expression that should be extracted and listed.
You can specify additional patterns to be ignored (in regex or plaintext format).
The final list is sorted and duplicates removed.
Easy to search multiple files; remembers file list.

Again this is a very niche tool but I may add features to it to make it more useful for other tasks. If you already have a good regular expression "extractor" that you are happy with, this is unlikely to replace it.

IainB · « **Reply #1 on:** May 04, 2018, 08:35 PM »

Oooh! That's rather nifty. Could come in rather handy. Thankyou.

IainB · « **Reply #2 on:** May 04, 2018, 09:17 PM »

Works rather well, and has helpful suggestions/favourites, etc.

By the way:

it shows up in DcUpdater as: v1.01.9, with no web version (yet).
the executable file is shown as: v1.1.2.0
the GUI "About" says it is: v1.01.02
the Help file (Overview) says: v1.01.01 - Apr 23, 2013 <-- !
the Help file (Version History) says: v1.01.01 - Apr 23, 2018

Ath · « **Reply #3 on:** May 05, 2018, 05:28 AM »

Sounds like a useful tool, much more convenient than extracting stuff using notepad++

The goals are similar to CodeByters Linebyter which I have used in the past but whose source code was lost.
-mouser (May 04, 2018, 06:36 PM)

That link isn't accessible for me, but that can be just me...

wraith808 · « **Reply #4 on:** May 05, 2018, 09:32 AM »

Nope. Not just you.

mouser · « **Reply #5 on:** May 05, 2018, 09:33 AM »

I have corrected the link to the old Linebyter program.

mouser · « **Reply #6 on:** May 05, 2018, 10:41 AM »

As of now, Regex Captor doesn't provide much if anything that other tools don't provide.

But I am open to feature requests if there are problems it might solve that other tools don't -- it would be fun if we could figure out some features that made the tool genuinely useful over other tools.

Ath · « **Reply #7 on:** May 06, 2018, 04:04 AM »

A useful feature could be to apply some automatic formatting to the results list, maybe similar to regex replace, where you can re-use the found result to create complete new texts:
Template:

Code: Text [Select]

<mailout name="\1" type="mailto">mailto://\1</mailout>

Resultlist:

Code: Text [Select]

<mailout name="[email protected]" type="mailto">mailto://[email protected]</mailout>
<mailout name="[email protected]" type="mailto">mailto://[email protected]</mailout>

(OT: The code=xml or code=html forum tags don't seem to work as intended, reverting to 'text' for now)

mouser · « **Reply #8 on:** May 06, 2018, 05:15 AM »

I think that's a great idea, I will add it.

wraith808 · « **Reply #9 on:** May 09, 2018, 11:50 AM »

ReExCaptor featured on GHacks.

https://www.ghacks.n...om-files-on-windows/

BGM · « **Reply #10 on:** January 04, 2019, 02:16 PM »

I always wanted a regex tester that had syntax colouring. It hurts my brain to try and figure out all the patterns. I've thought of even writing my own.

wraith808 · « **Reply #11 on:** January 04, 2019, 10:04 PM »

I always wanted a regex tester that had syntax colouring. It hurts my brain to try and figure out all the patterns. I've thought of even writing my own.
-BGM (January 04, 2019, 02:16 PM)

http://www.regexbuddy.com/

It's saved my sanity.

Ath · « **Reply #12 on:** January 05, 2019, 03:42 AM »

Or if you want a free (but online) solution: https://regex101.com

wraith808 · « **Reply #13 on:** January 05, 2019, 10:09 AM »

Or if you want a free (but online) solution: https://regex101.com
-Ath (January 05, 2019, 03:42 AM)

The thing that really helped me that was missing in those online solutions wa the ability to build regexes with parts and to see them broken down. I use some pretty complicated regexes, and though they can be parsed in those online tools, debugging them was still a pain.

Contro · « **Reply #14 on:** January 06, 2019, 02:19 PM »

Nice work.

One of the links fails https://www.donationcoder.com/404
or is in wrong position
There is a problem with the options :
Help-Visit Program Homepage
Help-Visit Program Forum

osensnolf · « **Reply #15 on:** January 17, 2019, 11:14 AM »

I joined just to reply - I ran the program to extract emails from a 2.6GB file but I keep getting this error.

External exception EEFFACE

This happens less than 10% through my search.

Another option to add - allow me to turn off Results Preview. For a large result, that consumes a lot of time when really all I want is the exported results.

How much $$ to get this resolved?

I downloaded the version in the original link. I am a Windows 7 user - is there a better option I should use if my only goal is to extract emails from multiple large files that I have?

THanks

mouser · « **Reply #16 on:** January 17, 2019, 12:17 PM »

Let me give it an update and see if I can reproduce the problem with a huge file.
Option to disable preview is a good idea..

Let me ask, can you see about how many results it found when it hit that error? It might give me a clue.

osensnolf · « **Reply #17 on:** January 17, 2019, 12:47 PM »

The file contains 119m rows of data at 2.6GB txt

It stops at.. Found 7968374 results. Scanning line 7970000

On row 7969999 (1 before), I see the following email.
[email protected]

I removed that value and saved and ran the test again but still it stopped at the same place like clockwork.

What's odd is that I downloaded another email extractor (much slower) and it too locks up around the same time but I do not know which row it is on. Storage and memory are not an issue.

I will try now with a larger and then a smaller file.

UPDATE with 3.5GB File with 27m rows, same error.
Found 9792207 results scanning line 11800000

mouser · « **Reply #18 on:** January 17, 2019, 03:28 PM »

I noticed when I last used it that it also seemed to be reporting an unusually high number of results, so it seems it's time for me to release an update. I'll try to get one done in the next few days.

No excuse for it throwing that exception -- it sounds like I will be able to reproduce and solve it if I just use a big enough input file. Standby.

ps. I think the default email regular expression I have in the program is not great -- it might be nice to find a better one.

osensnolf · « **Reply #19 on:** January 17, 2019, 03:33 PM »

If you need me to test it, you can send it to me on PM. I will pay once it is confirmed to work as I would like to have it ASAP.

Other than knowing what they are, I do not know anything about regular expressions so I'll avoid pasting something that I find online as I"m sure you will be able to find a better solution faster than me.

But yes, disabling the preview is a must. Even with a smaller file more time is spent populating the preview than actually getting the results.

Thank you!

mouser · « **Reply #20 on:** January 17, 2019, 03:51 PM »

Will do. I'll try to get a new version up tomorrow or this weekend.

osensnolf · « **Reply #21 on:** January 19, 2019, 10:49 PM »

Looking forward to it - very much.

mouser · « **Reply #22 on:** January 20, 2019, 01:34 AM »

ps the problem is related to the number of results not the size of the input file (tested with 12 gb files)..

mouser · « **Reply #23 on:** January 20, 2019, 08:22 AM »

So probably the problem is too many results -- either running out of memory for the raw results, or the display of results, or the sorting/de-duping operation.
I have a few options -- ranging from the most proper (64bit build which can use more memory) to the most kludgey (limit number of results found on a given pass and make user do multiple passes), to various things in between. Thinking on it.

HOWEVER, another solution is suggested by the fact that this regular expression pattern I provide by default is so imperfect and hits on many things that don't look like well-formed emails. A better regular expression that had less bad hits would also solve the problem. So for example if we limited the results to those with the most common extensions (.com, .org, etc...) you could probably eliminate most of the bad results and solve the problem of an overflow of results at the same time.

In fact, if extracting emails is the most common use for the app, I could actually hardcode some *proper* (non regular expression) email address validator code that could be an option, which would do that. Then we could use a quick and dirty regular expression to find email address candidates, followed by a proper email address validator algorithm. I'll probably add that just for fun.

mouser · « **Reply #24 on:** January 20, 2019, 02:27 PM »

Working on it now..