|
kalos
|
 |
« on: April 23, 2012, 11:14:43 AM » |
|
hello!
I have a word list (different each time), where each of the words (or lines, if you prefer) is the name of a chemical substance I have a huge folder of pdf files of the specifications of every substance
what I do is to gather all those specific pdfs of the specific substances in the list, to make bunch of pdf files that corresponds to every substance of the specific wordlist
but I need to automate that process, so can you suggest me please:
1) how can automatically bind the wordlist with the according pdf files 2) how can I create a single file (not sure what kind of file) that will contain all these relative files (compression files, rar, zip, etc would not be convenient, because they will require time to open) 3) how can I check at each point, if a specific project (eg the gathering of the pdfs of a wordlist) is incomplete and which exactly files are missing (so to add them manually)
can you suggest me an idea, a procedure, an excel file to build, and ahk script to build, a program, anything!
thanks!
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #1 on: April 24, 2012, 01:40:06 PM » |
|
any suggestion???
|
|
|
|
|
Logged
|
|
|
|
|
Ath
|
 |
« Reply #2 on: April 24, 2012, 02:24:47 PM » |
|
Probably needs more explanation, because I can't comprehend, even after reading it 3 times in a row 
|
|
|
|
|
Logged
|
|
|
|
|
|
rjbull
|
 |
« Reply #3 on: April 24, 2012, 03:04:16 PM » |
|
Ath's right, needs clarification. Maybe an example? Are your PDFs named according to the chemical itself, i.e. the same name as in the word list? You can (usually) concatenate PDFs from the command line using pdftk. So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them.
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #4 on: April 24, 2012, 04:03:03 PM » |
|
Are your PDFs named according to the chemical itself, i.e. the same name as in the word list?
yes You can (usually) concatenate PDFs from the command line using pdftk. So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them. I do not want to combine the pdfs and make one pdf I want to somehow attach them with the word list :/
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #5 on: April 24, 2012, 05:07:26 PM » |
|
The only thing I can think of is building some custom software/script to parse the text and create the database like that.
But without knowing much more, that's about all I can say.
For a single file, maybe store the PDFs in the DB as a blob in a single table?
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #6 on: April 24, 2012, 05:49:56 PM » |
|
You mean to use EXCEL files? That sounds good., and are there any kind of files similar? That can store blobs., etc
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #7 on: April 24, 2012, 08:26:23 PM » |
|
You mean to use EXCEL files? That sounds good., and are there any kind of files similar? That can store blobs., etc
Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes. Can you post a PDF?
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #8 on: April 25, 2012, 02:12:04 AM » |
|
You mean to use EXCEL files? That sounds good., and are there any kind of files similar? That can store blobs., etc
Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes. Can you post a PDF? but, cant EXCEL store a pdf or any other type, file in a cell? if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much) as for the pdfs, an example is this: http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf(renamed as Hydrochloric Muriatic Acid.pdf)
|
|
|
|
|
Logged
|
|
|
|
|
Ath
|
 |
« Reply #9 on: April 25, 2012, 02:30:59 AM » |
|
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually? And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #10 on: April 25, 2012, 02:41:21 AM » |
|
You mean to use EXCEL files? That sounds good., and are there any kind of files similar? That can store blobs., etc
Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes. Can you post a PDF? but, cant EXCEL store a pdf or any other type, file in a cell? if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much) as for the pdfs, an example is this: http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf(renamed as Hydrochloric Muriatic Acid.pdf) Well, yes, Excel *can* have a file embedded, but not in a cell (AFAIK). For an RDBMS, just about any will do. Take your pick of whatever you like really. For the PDF... Sorry. You're hosed. Completely hosed. The data in the PDF is in the first normal form (and *maybe* the second normal form). i.e. It's all mixed up and jumbled so as to be useless as a database. Now, it *is* possible to write some software to go through all the different cases and to parse it all, but it is VERY far from being a trivial/easy task. Actually, it's pretty easy, but it is extremely time consuming. The only solution I see that is remotely quick is to cut each line at the = sign then use the left side as a key and the right as a value. Keys can either be unique or not, but I would probably want to normalize things and try to force them to be unique. Someone else may have a better idea. As for extracting the data from the PDF, no clue. I loathe working with PDFs because they are a terminal format where once they are in that format, that's the end. Trying to get anything back out of them is a nightmare. You can try to save a PDF as a DOC to see just how terrifying the monstrosities that are produced are... I couldn't save the file you had as a DOC because it has a <bad font>, whatever that means... (not even as HTML...) I tried copying & pasting... No luck. Cells are collapsed to spaces, which effectively reduces any hope of the second normal form to the first normal form, i.e. useless. It's better to go from database to PDF. That makes sense and is manageable. Going from PDF to anything is nigh hopeless.
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #11 on: April 25, 2012, 12:53:56 PM » |
|
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually? And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.? each pdf describes one chemical each word of a wordlist is one chemical each chemical corresponds to one pdf each wordlist is a bunch of different chemicals chemicals in the wordlist are seperated by newlines I can say it is easy to write a script that will gather the pdfs in a folder, that their filenames are mentioned in a wordlist, one seperated by newline from the other (and ofcourse without the file extension .pdf been written in the list) but, after creating many folders with pdfs, where each folder will gather the pdfs of one wordlist, I need a way to know if a folder contains all the necessary pdfs mentioned in the wordlist, or if a pdf is missing (due to the fact that a word in the wordlist doesnt have a matching pdf)
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #12 on: April 26, 2012, 01:13:26 AM » |
|
Re-reading, I think I misunderstood before. Are you simply looking to match chemical names, e.g. "chemical name", to the PDFs as in "chemical name.pdf"? Here's a quick hack that will let you paste in a word list and check 1 folder: ChemicalPdfCheck.exe (9.5 KB - downloaded 37 times.) But that's all it does. It's VERY simple. I don't have time right now to do much more, but here's what it is in case anyone wants to continue after you verify that it's the kind of thing you're looking for. (Maybe I can finish later.) Here's the code for the 1 method to check: Formatted for C# with the GeSHI Syntax Highlighter [ copy or print] private void button1_Click(object sender, EventArgs e) { DialogResult dr = folderBrowserDialog1.ShowDialog(); if (dr != System.Windows.Forms.DialogResult.OK) { return; } // display the selected path textBox2.Text = folderBrowserDialog1.SelectedPath; // find the files that are missing foreach (string s in textBox1.Lines) { if (File.Exists(folderBrowserDialog1.SelectedPath + @"\" + s + ".pdf")) { // do nothing } else { // got a missing file - add it to our list of missing ones textBox3.Text += s + "\r\n"; } } }
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #13 on: April 26, 2012, 05:33:16 AM » |
|
much appreciated, thanks!
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #14 on: April 26, 2012, 07:06:45 AM » |
|
much appreciated, thanks!
So is that what in part what you're looking for? I should mention, that is not optimized for large lists, e.g. if you have 50,000 or so in a list, then it's going to be sloooowwww...
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #15 on: April 26, 2012, 10:24:34 AM » |
|
much appreciated, thanks!
So is that what in part what you're looking for? this is a part of the whole procedure Now, I think I am looking for a filetype or some kind of program that will be able to display the below info. I imagine it as a table, but it can be anything. - it will display an index/list of entries/fields: 1) chemical1 1.1) chemical2 1.2) chemical3 1.3) chemical4 2) chemical5 3) chemical6 etc - each entry/field will have a specific title (eg chemical1) and "inside" it, a specific file can be stored (pdf, txt, jpg, png files) - if a field/entry of the index/list (eg. 1.3) has not any file stored, it will be marked as red, which will indicate that it's empty, otherwise, it will be marked as green, which means that it has a file stored in it - all the files above, will be able to be extracted in a folder and subfolders, according to the list/index I dont know if EXCEL can do this, I bet AHK can do this, if it is easy, I will try, but is there a ready solution ?
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #16 on: April 26, 2012, 10:30:44 AM » |
|
1) chemical1 1.1) chemical2 1.2) chemical3 1.3) chemical4 2) chemical5 3) chemical6
Where does that information come from?
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #17 on: April 26, 2012, 11:10:13 AM » |
|
1) chemical1 1.1) chemical2 1.2) chemical3 1.3) chemical4 2) chemical5 3) chemical6
Where does that information come from? It is the wordlist, a textfile
|
|
|
|
|
Logged
|
|
|
|
|
Renegade
|
 |
« Reply #18 on: April 26, 2012, 11:26:49 AM » |
|
It is the wordlist, a textfile
But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant?
|
|
|
|
|
Logged
|
|
|
|
|
kalos
|
 |
« Reply #19 on: April 26, 2012, 12:22:22 PM » |
|
t It is the wordlist, a textfile
But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant? there is some hierarchy and some categorization, but I thought that, for begining, a simple wordlist would be suffice, and afterwards I would add the hierarchy and categories, etc
|
|
|
|
|
Logged
|
|
|
|
|