ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Program to extract / filter out content from XML files?

(1/5) > >>

tranglos:
I'm looking for a utility that would read XML files and filter out or export only the contents of specified tags. I need to take this:

Program to extract / filter out content from XML files?

and get only the "xxxxxx xxx xxxxxxxx xxx" content as output. (I've had to obscure the actual text with x-es in the screenshot, since when I get material for translation, it's accompanied by the most frightful NDAs you've ever seen outside of the likes of NSA/CIA, and I'm not even kidding.)

It has to work in batch mode or be able to load and process any number of files at a time. (The actual numbers are in the thousands, so a manual open->run->save process will not do.)

Has anyone come across such a thing?

(I've written a Python script that does the job just fine for a number of specific tags, but it would be nice to have a generic GUI solution.)

Thanks,
.marek

brett:
Hi Marek

If you need the output parsed to CSV then this is perfect. I love this utility

What you need is a little AHK and XML2CSV.
XML2CSV can be found here http://www.a7soft.com/xml2csv.html
Its a command line xml parser.

We could create a 'dropzone' or scan a folder in AHK. The output can then be saved in same/sub folder in CSV format with only the specified tags.

Let me know if you want this done in AHK or you want to do this in Python.
Should be less then 50 lines in ahk
(send a altered example to email if you want)

Brett
 



f0dder:
I guess your Python solution is using regular expressions, and you're running into / afraid of running into trouble?

A solution could be XSLT/XPATH - you can do some very powerful stuff that way, and pretty easily. Heck, once you get into it, XPATH is simpler than regex, but probably even a bit more powerful in the XML processing domain.

tranglos:

If you need the output parsed to CSV then this is perfect. I love this utility

What you need is a little AHK and XML2CSV.
XML2CSV can be found here http://www.a7soft.com/xml2csv.html
Its a command line xml parser.
-brett (January 25, 2008, 05:19 AM)
--- End quote ---

Thanks, Brett! Let me try this and see if I like it better than writing my own thing from scratch :)

marek

tranglos:
I guess your Python solution is using regular expressions, and you're running into / afraid of running into trouble?

A solution could be XSLT/XPATH - you can do some very powerful stuff that way, and pretty easily. Heck, once you get into it, XPATH is simpler than regex, but probably even a bit more powerful in the XML processing domain.
-f0dder (January 25, 2008, 06:16 AM)
--- End quote ---


In Python I'm using xml.sax. It works fantastically well, even though it's my first Python script and I'm sure it could be much shorter and more idiomatic. (I was going to post about my first night with Python, but real-life work preempted that... maybe later). I'm more of a GUI person, though, and I guess crafting GUIs in Python is best left to those who like building models of Cutty Sark in a bottle :) So the actual tags are hardcoded in the script. Maybe I'll just add a config file.

I could do the same thing in Delphi, if I can find a decent, free SAX parser, but it seemed like a good opportunity to try Python just for kicks. Other than that, maybe I don't need to reinvent the wheel if it already exists.

And yes, XSLT would probably fit the bill, since it can act as a filter, but XSLT black magic to me, and doesn't it need a browser? (Maybe not, shows what I know). Not sure about XPath, since I don't necessarily know the full path to my content, and again it would require hard-coding the sequence of tags, wouldn't it? Using SAX lets me just grab data from between the tags I need, and for anything more complex I could build a stack to keep track of where I am.

marek


Navigation

[0] Message Index

[#] Next page

Go to full version