Author Topic: Does anyone know of a good parser for wikipedia content (Read 3225 times)

normeus · « **on:** April 02, 2011, 12:51 PM »

Does anyone know of a good parser for wikipedia content? I don't want to write a full parser if I don't have to.
The parsing shouldn't be too bad in perl ( or any language with regex ) but I feel there should be a program since this sounds like something a lot of people would do.

export a file from wikipedia: (I would use a random animator name as an example "Craig_Clark")
wikipedia export

then from the xml page I would only use the text part
<text xml:space="preserve" bytes="3618"> text text 3618 bytes of text </text>
but what I need to do is convert the wiki text (ex: [[animator]]) to regular text (ex: animator).
anyway thanks for your comments.

Renegade · « **Reply #1 on:** April 03, 2011, 02:03 AM »

Off-hand I don't know of any, but if you get desperate, you can get the source and look through it. It would be painful, but it would also be more reliable and easier than doing it yourself.

Renegade · « **Reply #2 on:** April 03, 2011, 02:04 AM »

Here's a list:

http://www.mediawiki.../Alternative_parsers

That might have what you're looking for.

normeus · « **Reply #3 on:** April 03, 2011, 10:11 AM »

I found this program:
WP2TXT
Which does most of what I need but it does not work on some wiki dumps for some reason.
anyway for now it is the best I have found.

Thanks for the reply Renegade.

normeus · « **Reply #4 on:** April 03, 2011, 12:00 PM »

Is it me tired of looking at text? I typed Renegade and it shows as Reneqade?

Anyway thanks for the reply Renegade.

Author Topic: Does anyone know of a good parser for wikipedia content (Read 3225 times)

normeus

Does anyone know of a good parser for wikipedia content

Renegade

Re: Does anyone know of a good parser for wikipedia content

Renegade

Re: Does anyone know of a good parser for wikipedia content

normeus

Re: Does anyone know of a good parser for wikipedia content

normeus

Re: Does anyone know of a good parser for wikipedia content