Author Topic: IDEA: extract data field from files and produce alphabetical list (Read 29850 times)

DyNama · « **on:** November 02, 2011, 02:59 PM »

hi, gang, great forum! i produce an alphabetical list manually using several different programs but thought i'd see if i could automate the task.

a folder of my tv listing program has 2 kinds of files in it but i want to extract the names of tv shows from files matching the filename ScheduleData*.xml. here's a small chunk of the contents of 1 of the files:

<ch ChNo="15"><show Aff="" CId="28456692" PId="28410472" Title="Public Access" CLetter="PUAC015" STime="05/09/2011 14:00:00" Dur="240" Rep="N" New="" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="113" /></Categories></show></ch><ch ChNo="16"><show Aff="PBS" CId="28455507" PId="188545708" Title="WordWorld" CLetter="WPTD" STime="05/09/2011 15:30:00" Dur="30" Rep="Y" New="N" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="3" /><Category Id="7" /><Category Id="105" /><Category Id="106" /><Category Id="304" /><Category Id="702" /><Category Id="706" /><Category Id="1911" /></Categories></show></ch><ch ChNo="17"><show Aff="CBS" CId="28457103" PId="258095235" Title="The Price Is Right" CLetter="WHIO" STime="05/09/2011 15:00:00" Dur="60" Rep="N" New="Y" Logo="" Prem="" Fin=""><Categories><Category Id="7" /><Category Id="707" /><Category Id="1911" /></Categories></show></ch><ch ChNo="18"><show Aff="" CId="28455318" PId="28436365" Title="Information Channel" CLetter="INFO018" STime="05/09/2011 14:00:00" Dur="240" Rep="Y" New="N" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="113" /></Categories></show></ch>

the only data i want to extract are between Title= and CLetter markers, which i've bolded above.

what i want is a text file with all of those Titles from about 750 .xml files in alphabetical order (there will be 10s of thousands), stripping the quotes around the names and the Title= tag, eliminating all duplicates (about 35,000), and replacing & with a real ampersand, and open and close quotes with ordinary keyboard quotes " so that from the above sample i'd end up with a file that just has this in it:

Information Channel
Public Access
The Price Is Right
WordWorld

at present i use ReplaceText to isolate the Title= lines with carriage returns, Catview2000 to extract those lines to a new file, and TextPad to sort and eliminate duplicate titles. 2 of those programs are no longer being developed, the 3rd is unregistered shareware. since it is some trouble, i only make a new list occasionally.

skwire · « **Reply #1 on:** November 02, 2011, 03:03 PM »

Can you provide us with a few of those XML files?

DyNama · « **Reply #2 on:** November 02, 2011, 04:19 PM »

sure, here's 5 of them.

skwire · « **Reply #3 on:** November 02, 2011, 05:01 PM »

So, to recap, what you want is:

All titles extracted into a list.
UnHTML them.
Sort them.
Remove duplicates.
Spit this data out into a text file.

Questions:

I will assume that you would like to specify a root folder, have the program recurse through the subfolders, and build a list of all the titles in can find in all these Schedule*.XML files you have, right?
The original Schedule*.XML files are in UTF-16 format. Would you like to keep that or convert them to something else? UTF-8, etc.?

DyNama · « **Reply #4 on:** November 02, 2011, 06:29 PM »

So, to recap, what you want is:

All titles extracted into a list.
UnHTML them.
Sort them.
Remove duplicates.
Spit this data out into a text file.

Questions:

I will assume that you would like to specify a root folder, have the program recurse through the subfolders, and build a list of all the titles in can find in all these Schedule*.XML files you have, right?
The original Schedule*.XML files are in UTF-16 format. Would you like to keep that or convert them to something else? UTF-8, etc.?
-skwire (November 02, 2011, 05:01 PM)

the source folder is always the same, the path gets changed only rarely, and all the source files are in that one folder, but there's also 750 Programme*.dat files in there too. the source files can stay unchanged, they get changed daily by the tv listing program, Digiguide (it's a UK programme). the output file, just a plain text file, would always be the same too, programmes.uq, in a different folder, overwriting the existing file. it's in Program Files (x86) and i sometimes have problems writing files to folders there--durn win7 permissions!--tho at the mo it lets me write or copy to this file.

thanx for your kind and prompt attention, Skwire! i used to code on the spot with AutoLISP when i was an AutoCAD draftsman (laid off), but i never learned any windows programming languages.

DyNama · « **Reply #5 on:** November 02, 2011, 07:04 PM »

i ought to add, per #3, that the sort is to be not case sensitive.

skwire · « **Reply #6 on:** November 02, 2011, 09:25 PM »

Please try this: DyNamaParser

Run it, select your folder and let it work. You can view the progress by hovering over the tray icon. When it's done, it will pop up the standard save file dialog. Please let me know how it works out for you. Thanks.

DyNama · « **Reply #7 on:** November 02, 2011, 10:15 PM »

worked great! that is absolutely marvelous! extraordinary! i've wanted a program like this for years! thank you very much!

was that much trouble? i'm going to have to look into AHK, it seems to be very powerful.

this UK program provides this list for it's UK customers but not it's US customers. i actually complained about the missing list on their forum! i will post it in my old thread but for all i know, i'm the only one who missed it. thanks again!

DyNama · « **Reply #8 on:** November 02, 2011, 10:28 PM »

oops, i did find a little error in it. in the ScheduleData-45133-OH34525R(All)-1321308000.XML file, the 50kb file, sample is a title that contains a comma:

Title="1,000 Ways to Die" CLetter="SPIKETV"

and the parser program cuts off the comma and everything in front of it: 000 Ways to Die

skwire · « **Reply #9 on:** November 02, 2011, 11:51 PM »

worked great! that is absolutely marvelous! extraordinary! i've wanted a program like this for years! thank you very much!
-DyNama (November 02, 2011, 10:15 PM)

You're very welcome. I'm glad it worked for you.

was that much trouble? i'm going to have to look into AHK, it seems to be very powerful.
-DyNama (November 02, 2011, 10:15 PM)

No trouble at all. I like parsing snacks like this one.

oops, i did find a little error in it. in the ScheduleData-45133-OH34525R(All)-1321308000.XML file, the 50kb file, sample is a title that contains a comma:

Title="1,000 Ways to Die" CLetter="SPIKETV"

and the parser program cuts off the comma and everything in front of it: 000 Ways to Die
-DyNama (November 02, 2011, 10:28 PM)

Doh.

Totally my fault. Please re-download and see if this build fixes that issue. DyNamaParser

DyNama · « **Reply #10 on:** November 03, 2011, 12:41 AM »

that'll do it! fantastic! thanx again!

mouser · « **Reply #11 on:** November 03, 2011, 05:48 AM »

Just wanted to stick my nose in the door and say how much I love seeing these kinds of threads

DyNama · « **Reply #12 on:** November 03, 2011, 10:35 AM »

Just wanted to stick my nose in the door and say how much I love seeing these kinds of threads
-mouser (November 03, 2011, 05:48 AM)

thanx for the cool service, Mouser!

i've run our program, which i renamed DGNameParser.exe, several times without any problems

but when a fellow US user tried it, he posted this to that software's forum:

When running DyNamaParsar I get the following error msg. start when processing 495 of 576:

Error: Memory limit reacched (see #Maxmem in the help file).
The current thread will exit.
Line#
042:myBlock .= myTitles . "

skwire, what could cause that?

skwire · « **Reply #13 on:** November 03, 2011, 10:45 AM »

http://www.autohotke...commands/_MaxMem.htm

AutoHotkey, by default, restricts the amount of memory per variable to 64 megs (but is easily changeable up to four gigs per variable). This begs the question, though... How large are the files he is running this on?

If you want, have him zip them up and send them to me for testing.

Here's a version that increases that to one gig per var: DyNamaParser

DyNama · « **Reply #14 on:** November 03, 2011, 12:50 PM »

... How large are the files he is running this on? If you want, have him zip them up and send them to me for testing.
-skwire (November 03, 2011, 10:45 AM)

i don't really know. i would guess the number of ScheduleData files are governed by how many weeks of past tv listings we keep and the size is governed by the number of channels in the listings. the total size of all my ScheduleData*.xml files is 1.7gb.

skwire · « **Reply #15 on:** November 03, 2011, 01:49 PM »

i don't really know. i would guess the number of ScheduleData files are governed by how many weeks of past tv listings we keep and the size is governed by the number of channels in the listings. the total size of all my ScheduleData*.xml files is 1.7gb.
-DyNama (November 03, 2011, 12:50 PM)

Understood. Well, if you want to, you could 7-zip your collection and FTP it to me so I can test (and possibly improve your parser) against it.

DyNama · « **Reply #16 on:** November 03, 2011, 02:30 PM »

i can do that! but i made a mistake, all of them total only 76megs

i must have included all the other files in this folder.

7zip Ultra compresses them down to only 2.3megs! gotta love file compression!

skwire · « **Reply #17 on:** November 03, 2011, 04:01 PM »

Thanks for those files. I optimised the parser by a factor of five on my machine. The original build took 124 seconds to complete through your file set. This new build takes 20 seconds. Give it a try and let me know how it works out for you.

DyNamaParser

DyNama · « **Reply #18 on:** November 03, 2011, 05:20 PM »

oh, yeah, much faster! the previous version took long enough to get distracted!

thanx!

Author Topic: IDEA: extract data field from files and produce alphabetical list (Read 29850 times)

DyNama

IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

mouser

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list

skwire

Re: IDEA: extract data field from files and produce alphabetical list

DyNama

Re: IDEA: extract data field from files and produce alphabetical list