Welcome Guest.   Make a donation to an author on the site August 20, 2014, 03:31:45 PM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2014! Download dozens of custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: Looking for software tools to extract specific text info from html pages  (Read 1498 times)
trose
Supporting Member
**
Posts: 2


View Profile Give some DonationCredits to this forum member
« on: February 15, 2010, 02:04:32 AM »

Dear fellow DC members,

I am looking for software tools that would enable me to extract information from a large number of html pages. These html pages have already been downloaded to my computer's hard drive, so no web access is needed. For each file, I want to extract a person's name, email address, and phone number. Not all this information will exist on each html page. Each html page contains the information for one individual.

I now extract information manually by:
Loading an html page into Note Tab Standard
Stripping html tags
Cut and paste (into another text file) the desired information
Repeat ad nauseum

Currently there is a subdirectory of approximately 500 such html pages. I want to automate this process so it can be executed on other data sources. I am NOT a programmer, so if someone suggests using a scripting language and regular expressions, please be prepared to walk me through the process, but I am not afraid of the command line. I would prefer to use one or more software tools, whether command line or GUI-based.

If anyone can suggest either free or commercial (not too expensive!) software tools to accomplish this task, I will be grateful. I have been a member of Donation Coder for a while, and downloaded some tools (use FARR every day), but this is the first time I have asked for your help in solving a problem. I hope you guys and gals can come up with something. If I have posted this request in an inappropriate location, please let me know.

Thanks for your time.

Ted Rose
[EMAIL REMOVED TO AVOID SPAMBOTS FINDING IT -- SEND A MESSAGE ON THE FORUM TO TED TO CONTACT HIM DIRECTLY]
« Last Edit: February 15, 2010, 06:12:03 AM by mouser » Logged
f0dder
Charter Honorary Member
***
Posts: 8,774



[Well, THAT escalated quickly!]

see users location on a map View Profile WWW Read user's biography. Give some DonationCredits to this forum member
« Reply #1 on: February 15, 2010, 03:38:31 AM »

Are the HTML files formatted/structured in a consistent way? If they are, it should be easy writing regular expressions to extract the information. If it's "just a bunch of pages" that have names, addresses and emails as part of other content, that will probably not work smiley

If the files are consistent, can you provide an anonymized version of one of the files? Ie., something that has all the header/footer fluff and general HTML structure (also anonymized, if need be), along with some anon name/addr/mail entries (if no fields are optional, a single entry would be enough - if some fields are optional, it would be good having a few entries so one can see what a missing field results in wrt. the HTML).
Logged

- carpe noctem
cmpm
Charter Member
***
Posts: 2,025

View Profile Give some DonationCredits to this forum member
« Reply #2 on: February 15, 2010, 08:06:51 AM »

http://www.listextractor.com/

http://www.worldminer.com/

a couple worth a look i think
from this search string

http://www.google.com/sea...cial&client=firefox-a
Logged
widgewunner
Member
**
Posts: 86


View Profile Give some DonationCredits to this forum member
« Reply #3 on: February 15, 2010, 04:30:54 PM »

f0dder is correct. This sounds like a job for Regex! If you can post a few examples of the html files (after changing the actual names/emails of course), I'm sure we can help you out.

I'm a regex addict and live for this sort of thing! (sick I know!) smiley
Logged
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.027s | Server load: 0.16 ]