ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

DonationCoder.com Software > Post New Requests Here

IDEA: extract data field from files and produce alphabetical list

(1/4) > >>

DyNama:
hi, gang, great forum! i produce an alphabetical list manually using several different programs but thought i'd see if i could automate the task.

a folder of my tv listing program has 2 kinds of files in it but i want to extract the names of tv shows from files matching the filename ScheduleData*.xml. here's a small chunk of the contents of 1 of the files:

<ch ChNo="15"><show Aff="" CId="28456692" PId="28410472" Title="Public Access" CLetter="PUAC015" STime="05/09/2011 14:00:00" Dur="240" Rep="N" New="" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="113" /></Categories></show></ch><ch ChNo="16"><show Aff="PBS" CId="28455507" PId="188545708" Title="WordWorld" CLetter="WPTD" STime="05/09/2011 15:30:00" Dur="30" Rep="Y" New="N" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="3" /><Category Id="7" /><Category Id="105" /><Category Id="106" /><Category Id="304" /><Category Id="702" /><Category Id="706" /><Category Id="1911" /></Categories></show></ch><ch ChNo="17"><show Aff="CBS" CId="28457103" PId="258095235" Title="The Price Is Right" CLetter="WHIO" STime="05/09/2011 15:00:00" Dur="60" Rep="N" New="Y" Logo="" Prem="" Fin=""><Categories><Category Id="7" /><Category Id="707" /><Category Id="1911" /></Categories></show></ch><ch ChNo="18"><show Aff="" CId="28455318" PId="28436365" Title="Information Channel" CLetter="INFO018" STime="05/09/2011 14:00:00" Dur="240" Rep="Y" New="N" Logo="" Prem="" Fin=""><Categories><Category Id="1" /><Category Id="113" /></Categories></show></ch>

--- End quote ---
the only data i want to extract are between Title= and CLetter markers, which i've bolded above.

what i want is a text file with all of those Titles from about 750 .xml files in alphabetical order (there will be 10s of thousands), stripping the quotes around the names and the Title= tag, eliminating all duplicates (about 35,000), and replacing &amp; with a real ampersand, and open and close quotes with ordinary keyboard quotes " so that from the above sample i'd end up with a file that just has this in it:

Information Channel
Public Access
The Price Is Right
WordWorld

--- End quote ---

at present i use ReplaceText to isolate the Title= lines with carriage returns, Catview2000 to extract those lines to a new file, and TextPad to sort and eliminate duplicate titles. 2 of those programs are no longer being developed, the 3rd is unregistered shareware. since it is some trouble, i only make a new list occasionally.

skwire:
Can you provide us with a few of those XML files?

DyNama:
sure, here's 5 of them.

skwire:
So, to recap, what you want is:


* All titles extracted into a list.
* UnHTML them.
* Sort them.
* Remove duplicates.
* Spit this data out into a text file.
Questions:


* I will assume that you would like to specify a root folder, have the program recurse through the subfolders, and build a list of all the titles in can find in all these Schedule*.XML files you have, right?
* The original Schedule*.XML files are in UTF-16 format.  Would you like to keep that or convert them to something else?  UTF-8, etc.?

DyNama:
So, to recap, what you want is:


* All titles extracted into a list.
* UnHTML them.
* Sort them.
* Remove duplicates.
* Spit this data out into a text file.
Questions:


* I will assume that you would like to specify a root folder, have the program recurse through the subfolders, and build a list of all the titles in can find in all these Schedule*.XML files you have, right?
* The original Schedule*.XML files are in UTF-16 format.  Would you like to keep that or convert them to something else?  UTF-8, etc.?-skwire (November 02, 2011, 05:01 PM)
--- End quote ---

the source folder is always the same, the path gets changed only rarely, and all the source files are in that one folder, but there's also 750 Programme*.dat files in there too. the source files can stay unchanged, they get changed daily by the tv listing program, Digiguide (it's a UK programme). the output file, just a plain text file, would always be the same too, programmes.uq, in a different folder, overwriting the existing file. it's in Program Files (x86) and i sometimes have problems writing files to folders there--durn win7 permissions!--tho at the mo it lets me write or copy to this file.

thanx for your kind and prompt attention, Skwire! i used to code on the spot with AutoLISP when i was an AutoCAD draftsman (laid off), but i never learned any windows programming languages.

Navigation

[0] Message Index

[#] Next page

Go to full version