topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday April 18, 2024, 9:25 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Looking for software tools to extract specific text info from html pages  (Read 3391 times)

trose

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 2
    • View Profile
    • Donate to Member
Dear fellow DC members,

I am looking for software tools that would enable me to extract information from a large number of html pages. These html pages have already been downloaded to my computer's hard drive, so no web access is needed. For each file, I want to extract a person's name, email address, and phone number. Not all this information will exist on each html page. Each html page contains the information for one individual.

I now extract information manually by:
Loading an html page into Note Tab Standard
Stripping html tags
Cut and paste (into another text file) the desired information
Repeat ad nauseum

Currently there is a subdirectory of approximately 500 such html pages. I want to automate this process so it can be executed on other data sources. I am NOT a programmer, so if someone suggests using a scripting language and regular expressions, please be prepared to walk me through the process, but I am not afraid of the command line. I would prefer to use one or more software tools, whether command line or GUI-based.

If anyone can suggest either free or commercial (not too expensive!) software tools to accomplish this task, I will be grateful. I have been a member of Donation Coder for a while, and downloaded some tools (use FARR every day), but this is the first time I have asked for your help in solving a problem. I hope you guys and gals can come up with something. If I have posted this request in an inappropriate location, please let me know.

Thanks for your time.

Ted Rose
[EMAIL REMOVED TO AVOID SPAMBOTS FINDING IT -- SEND A MESSAGE ON THE FORUM TO TED TO CONTACT HIM DIRECTLY]
« Last Edit: February 15, 2010, 06:12 AM by mouser »

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Are the HTML files formatted/structured in a consistent way? If they are, it should be easy writing regular expressions to extract the information. If it's "just a bunch of pages" that have names, addresses and emails as part of other content, that will probably not work :)

If the files are consistent, can you provide an anonymized version of one of the files? Ie., something that has all the header/footer fluff and general HTML structure (also anonymized, if need be), along with some anon name/addr/mail entries (if no fields are optional, a single entry would be enough - if some fields are optional, it would be good having a few entries so one can see what a missing field results in wrt. the HTML).
- carpe noctem

cmpm

  • Charter Member
  • Joined in 2006
  • ***
  • default avatar
  • Posts: 2,026
    • View Profile
    • Donate to Member

widgewunner

  • Member
  • Joined in 2009
  • **
  • Posts: 93
    • View Profile
    • Donate to Member
f0dder is correct. This sounds like a job for Regex! If you can post a few examples of the html files (after changing the actual names/emails of course), I'm sure we can help you out.

I'm a regex addict and live for this sort of thing! (sick I know!) :)