Does anyone know of a good parser for wikipedia content

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

(1/1)

normeus:
Does anyone know of a good parser for wikipedia content? I don't want to write a full parser if I don't have to.
The parsing shouldn't be too bad in perl ( or any language with regex ) but I feel there should be a program since this sounds like something a lot of people would do.

export a file from wikipedia: (I would use a random animator name as an example "Craig_Clark")
wikipedia export

then from the xml page I would only use the text part
<text xml:space="preserve" bytes="3618"> text text 3618 bytes of text </text>
but what I need to do is convert the wiki text (ex: [[animator]]) to regular text (ex: animator).
anyway thanks for your comments.

Renegade:
Off-hand I don't know of any, but if you get desperate, you can get the source and look through it. It would be painful, but it would also be more reliable and easier than doing it yourself.

Renegade:
Here's a list:

http://www.mediawiki.org/wiki/Alternative_parsers

That might have what you're looking for.

normeus:
I found this program:
WP2TXT
Which does most of what I need but it does not work on some wiki dumps for some reason.
anyway for now it is the best I have found.

Thanks for the reply Renegade.

normeus:
Is it me tired of looking at text? I typed Renegade and it shows as Reneqade?

Anyway thanks for the reply Renegade.

Navigation

[0] Message Index

Go to full version