Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 03, 2016, 12:48:46 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Does anyone know of a good parser for wikipedia content  (Read 1554 times)

normeus

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 13
    • View Profile
    • Donate to Member
Does anyone know of a good parser for wikipedia content
« on: April 02, 2011, 12:51:41 PM »
Does anyone know of a good parser for wikipedia content? I don't want to write a full parser if I don't have to.
The parsing shouldn't be too bad in perl ( or any language with regex ) but I feel there should be a program since this sounds like something a lot of people would do.

export a file from wikipedia: (I would use a random animator name as an example "Craig_Clark")
wikipedia export

then from the xml page I would only use the text part
  <text xml:space="preserve" bytes="3618"> text text 3618 bytes of text </text>
but what I need to do is convert the wiki text (ex:  [[animator]]) to regular text (ex: animator).
anyway thanks for your comments.

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 13,220
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: Does anyone know of a good parser for wikipedia content
« Reply #1 on: April 03, 2011, 02:03:36 AM »
Off-hand I don't know of any, but if you get desperate, you can get the source and look through it. It would be painful, but it would also be more reliable and easier than doing it yourself.
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 13,220
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: Does anyone know of a good parser for wikipedia content
« Reply #2 on: April 03, 2011, 02:04:34 AM »
Here's a list:

http://www.mediawiki.../Alternative_parsers

That might have what you're looking for.
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

normeus

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 13
    • View Profile
    • Donate to Member
Re: Does anyone know of a good parser for wikipedia content
« Reply #3 on: April 03, 2011, 10:11:21 AM »
I found this program:
WP2TXT
Which does most of what I need but it does not work on some wiki dumps for some reason.
anyway for now it is the best I have found.

Thanks for the reply Renegade.

normeus

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 13
    • View Profile
    • Donate to Member
Re: Does anyone know of a good parser for wikipedia content
« Reply #4 on: April 03, 2011, 12:00:29 PM »
Is it me tired of looking at text? I typed Renegade and it shows as Reneqade?

Anyway thanks for the reply Renegade.