Author Topic: Looking for software tools to extract specific text info from html pages (Read 3391 times)

trose · « **on:** February 15, 2010, 02:04 AM »

Dear fellow DC members,

I am looking for software tools that would enable me to extract information from a large number of html pages. These html pages have already been downloaded to my computer's hard drive, so no web access is needed. For each file, I want to extract a person's name, email address, and phone number. Not all this information will exist on each html page. Each html page contains the information for one individual.

I now extract information manually by:
Loading an html page into Note Tab Standard
Stripping html tags
Cut and paste (into another text file) the desired information
Repeat ad nauseum

Currently there is a subdirectory of approximately 500 such html pages. I want to automate this process so it can be executed on other data sources. I am NOT a programmer, so if someone suggests using a scripting language and regular expressions, please be prepared to walk me through the process, but I am not afraid of the command line. I would prefer to use one or more software tools, whether command line or GUI-based.

If anyone can suggest either free or commercial (not too expensive!) software tools to accomplish this task, I will be grateful. I have been a member of Donation Coder for a while, and downloaded some tools (use FARR every day), but this is the first time I have asked for your help in solving a problem. I hope you guys and gals can come up with something. If I have posted this request in an inappropriate location, please let me know.

Thanks for your time.

Ted Rose
[EMAIL REMOVED TO AVOID SPAMBOTS FINDING IT -- SEND A MESSAGE ON THE FORUM TO TED TO CONTACT HIM DIRECTLY]

f0dder · « **Reply #1 on:** February 15, 2010, 03:38 AM »

Are the HTML files formatted/structured in a consistent way? If they are, it should be easy writing regular expressions to extract the information. If it's "just a bunch of pages" that have names, addresses and emails as part of other content, that will probably not work

If the files are consistent, can you provide an anonymized version of one of the files? Ie., something that has all the header/footer fluff and general HTML structure (also anonymized, if need be), along with some anon name/addr/mail entries (if no fields are optional, a single entry would be enough - if some fields are optional, it would be good having a few entries so one can see what a missing field results in wrt. the HTML).

cmpm · « **Reply #2 on:** February 15, 2010, 08:06 AM »

http://www.listextractor.com/

http://www.worldminer.com/

a couple worth a look i think
from this search string

http://www.google.co...amp;client=firefox-a

widgewunner · « **Reply #3 on:** February 15, 2010, 04:30 PM »

f0dder is correct. This sounds like a job for Regex! If you can post a few examples of the html files (after changing the actual names/emails of course), I'm sure we can help you out.

I'm a regex addict and live for this sort of thing! (sick I know!)

Author Topic: Looking for software tools to extract specific text info from html pages (Read 3391 times)

trose

Looking for software tools to extract specific text info from html pages

f0dder

Re: Looking for software tools to extract specific text info from html pages

cmpm

Re: Looking for software tools to extract specific text info from html pages

widgewunner

Re: Looking for software tools to extract specific text info from html pages