ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Special User Sections > N.A.N.Y. 2019

NANY 2019: RegexCaptor - Simple app to extract email or other patterns from text

<< < (5/6) > >>

mouser:
Will do.  I'll try to get a new version up tomorrow or this weekend.

osensnolf:
Looking forward to it - very much.

mouser:
ps the problem is related to the number of results not the size of the input file (tested with 12 gb files)..

mouser:
So probably the problem is too many results -- either running out of memory for the raw results, or the display of results, or the sorting/de-duping operation.
I have a few options -- ranging from the most proper (64bit build which can use more memory) to the most kludgey (limit number of results found on a given pass and make user do multiple passes), to various things in between.  Thinking on it.

HOWEVER, another solution is suggested by the fact that this regular expression pattern I provide by default is so imperfect and hits on many things that don't look like well-formed emails.  A better regular expression that had less bad hits would also solve the problem.  So for example if we limited the results to those with the most common extensions (.com, .org, etc...) you could probably eliminate most of the bad results and solve the problem of an overflow of results at the same time.

In fact, if extracting emails is the most common use for the app, I could actually hardcode some *proper* (non regular expression) email address validator code that could be an option, which would do that.  Then we could use a quick and dirty regular expression to find email address candidates, followed by a proper email address validator algorithm. I'll probably add that just for fun.

mouser:
Working on it now..

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version