topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday March 28, 2024, 8:41 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Program to extract / filter out content from XML files?  (Read 18059 times)

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Program to extract / filter out content from XML files?
« on: January 24, 2008, 01:22 PM »
I'm looking for a utility that would read XML files and filter out or export only the contents of specified tags. I need to take this:

primevaljungle.pngProgram to extract / filter out content from XML files?

and get only the "xxxxxx xxx xxxxxxxx xxx" content as output. (I've had to obscure the actual text with x-es in the screenshot, since when I get material for translation, it's accompanied by the most frightful NDAs you've ever seen outside of the likes of NSA/CIA, and I'm not even kidding.)

It has to work in batch mode or be able to load and process any number of files at a time. (The actual numbers are in the thousands, so a manual open->run->save process will not do.)

Has anyone come across such a thing?

(I've written a Python script that does the job just fine for a number of specific tags, but it would be nice to have a generic GUI solution.)

Thanks,
.marek

brett

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 125
  • Australia
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #1 on: January 25, 2008, 05:19 AM »
Hi Marek

If you need the output parsed to CSV then this is perfect. I love this utility

What you need is a little AHK and XML2CSV.
XML2CSV can be found here http://www.a7soft.com/xml2csv.html
Its a command line xml parser.

We could create a 'dropzone' or scan a folder in AHK. The output can then be saved in same/sub folder in CSV format with only the specified tags.

Let me know if you want this done in AHK or you want to do this in Python.
Should be less then 50 lines in ahk
(send a altered example to email if you want)

Brett
 




f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #2 on: January 25, 2008, 06:16 AM »
I guess your Python solution is using regular expressions, and you're running into / afraid of running into trouble?

A solution could be XSLT/XPATH - you can do some very powerful stuff that way, and pretty easily. Heck, once you get into it, XPATH is simpler than regex, but probably even a bit more powerful in the XML processing domain.
- carpe noctem

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #3 on: January 25, 2008, 06:41 AM »

If you need the output parsed to CSV then this is perfect. I love this utility

What you need is a little AHK and XML2CSV.
XML2CSV can be found here http://www.a7soft.com/xml2csv.html
Its a command line xml parser.

Thanks, Brett! Let me try this and see if I like it better than writing my own thing from scratch :)

marek

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #4 on: January 25, 2008, 07:24 AM »
I guess your Python solution is using regular expressions, and you're running into / afraid of running into trouble?

A solution could be XSLT/XPATH - you can do some very powerful stuff that way, and pretty easily. Heck, once you get into it, XPATH is simpler than regex, but probably even a bit more powerful in the XML processing domain.


In Python I'm using xml.sax. It works fantastically well, even though it's my first Python script and I'm sure it could be much shorter and more idiomatic. (I was going to post about my first night with Python, but real-life work preempted that... maybe later). I'm more of a GUI person, though, and I guess crafting GUIs in Python is best left to those who like building models of Cutty Sark in a bottle :) So the actual tags are hardcoded in the script. Maybe I'll just add a config file.

I could do the same thing in Delphi, if I can find a decent, free SAX parser, but it seemed like a good opportunity to try Python just for kicks. Other than that, maybe I don't need to reinvent the wheel if it already exists.

And yes, XSLT would probably fit the bill, since it can act as a filter, but XSLT black magic to me, and doesn't it need a browser? (Maybe not, shows what I know). Not sure about XPath, since I don't necessarily know the full path to my content, and again it would require hard-coding the sequence of tags, wouldn't it? Using SAX lets me just grab data from between the tags I need, and for anything more complex I could build a stack to keep track of where I am.

marek


« Last Edit: January 25, 2008, 07:50 AM by tranglos »

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #5 on: January 25, 2008, 08:18 AM »
I dunno if there's XSLT libraries/bindings for Python (there probably are), but I know there's several C/C++ libraries, so you don't need a browser. (Afaik XSLT depends a lot on XPATH to do that translations?)

Thing with XPATH is that you can use those * operators, so you can locate the sub-tag you want. But I'm not that familiar with it yet :)
- carpe noctem

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #6 on: January 26, 2008, 07:54 AM »
my first night with Python
Dreamy  :-*

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #7 on: January 26, 2008, 08:02 AM »
I dunno if there's XSLT libraries/bindings for Python (there probably are)

lxml has both XPath and XSLT support.

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #8 on: January 26, 2008, 08:09 AM »
I could do the same thing in Delphi,

Adolescent Humor
:o Choosing between Python and Delphi is like trying to decide which one of the Olsen Twins you want to sleep with.


Lashiec

  • Member
  • Joined in 2006
  • **
  • Posts: 2,374
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #9 on: January 26, 2008, 09:31 AM »
So that means we should avoid both languages?

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #10 on: January 26, 2008, 10:16 AM »
Choosing between Python and Delphi is like...

And the best thing is, you can do both at the same time:
http://mmm-experts.c...cts.aspx?ProductId=3

For best results it helps to have Delphi 7 or later, since in D7 Borland added $METHODINFO define, on which Python4Delphi relies to translate public methods of Delphi classes to methods you can freely call within Python. All it takes is 2-3 lines of glue code per class, and the result is truly awesome.

(The same can be achieved in Delphi 6, but requires some deep magic to recreate the functionality $METHODINFO provides in D7).

More adolescent humor
I said "do"!


tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #11 on: January 26, 2008, 10:20 AM »
my first night with Python
Dreamy  :-*

Thank you, tinjaw! Hook, line, sinker :)

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #12 on: January 26, 2008, 02:30 PM »
So that means we should avoid both languages?

yet more adolescent humor
Lashiec is gaaaayyyyyy. Lashiec is gaaaayyyyy. He doesn't want to sleep with the Olsen Twins. Brokeback Lashiec.


tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #13 on: January 26, 2008, 02:35 PM »
And the best thing is, you can do both at the same time:
http://mmm-experts.c...cts.aspx?ProductId=3
Holy ! I think I just wet my pants.
This paper will demonstrate how to incorporate Python in any of your Delphi apps!  Or conversely, how to use Delphi as a GUI for your python apps.
-Andy Bulka
Time to email my buddies at Borland Inprise Borland CodeGear and get a copy of Delphi.

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #14 on: January 26, 2008, 04:24 PM »
Hehe, I love your strike-throughs, tinjaw ^_^

Incorporating Python in other languages, ho humm. I'd personally go for something a bit more light-weight like LUA, but Python is a darn nice language (and set of standard libraries!) for a lot of stuff.
- carpe noctem

Lashiec

  • Member
  • Joined in 2006
  • **
  • Posts: 2,374
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #15 on: January 26, 2008, 06:49 PM »
[Adolescent humour]

I was expecting that! Look, how to make an appropriate comparison (I suppose Python is better than Delphi):

Spoiler
Having to decide between Python and Delphi is like refusing an invitation from Scarlett Johansson: You'll end up in her twin brother bed.


See? Now it's much clear. I hope tranglos does not go mad
« Last Edit: January 26, 2008, 07:56 PM by Lashiec »

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #16 on: January 26, 2008, 07:37 PM »
I'm enjoying every byte of it! (*scribbles notes*) Please continue. (*scribbles*)

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #17 on: January 26, 2008, 08:48 PM »
(I've written a Python script that does the job just fine for a number of specific tags, but it would be nice to have a generic GUI solution.)
-tranglos

Why for the desire for GUI?  :huh: What would it do that the command line script doesn't?

tranglos

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,081
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #18 on: January 26, 2008, 09:13 PM »
Why for the desire for GUI?  :huh: What would it do that the command line script doesn't?

Example:

Right now I am reviewing translated files that comprise a large website, 8325 xml files in nested folders. I'm looking only at files that have been recently added or updated, of which there's about a hundred. When I finish reviewing, I'll need to run spellcheck on files I have modified, and only on those (the other files had been spellchecked at earlier reviews). The files are bilingual, so I need to extract only the translated content first. This is the job for the Python script.

But how do I feed relevant files to the script? Wildcards won't do; I would have to invent some command-line syntax to specify dates, parse it, compare against filesystem datestamps, etc.

I could do this, but I'm too lazy, and there's a much easier way: using Total Commander, I can easily filter out the recently modified files, it takes seconds, and then I can drag-drop them onto -- well, onto nothing at the moment. This is where a window that accepts dropped files would be quicker and more flexible than pure command line.

(Instead, I do as above, but copy the modified files to a temp folder and run the script from there.)

OK, so that's a lame and wordy excuse. I just like GUI-ness!

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #19 on: January 26, 2008, 09:28 PM »
Create a batch file on your desktop and drag the files onto it.
FOR %%I IN (%*) DO TYPE %1
PAUSE

That will print out the contents of the files. Drag some text files onto it.

Replace TYPE with the Python script.

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: Program to extract / filter out content from XML files?
« Reply #20 on: January 29, 2008, 08:36 AM »
I dunno if there's XSLT libraries/bindings for Python (there probably are)

There is now a new open source python XPath library available, WebPath.