ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

automate database building

<< < (3/4) > >>

Renegade:
You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc
-kalos (April 24, 2012, 05:49 PM)
--- End quote ---

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?
-Renegade (April 24, 2012, 08:26 PM)
--- End quote ---

but, cant EXCEL store a pdf or any other type, file in a cell?
if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much)

as for the pdfs, an example is this:
http://www.purolite.com/RelId/606346/ISvars/default/customized/uploads/pdfs/Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
-kalos (April 25, 2012, 02:12 AM)
--- End quote ---

Well, yes, Excel *can* have a file embedded, but not in a cell (AFAIK).

For an RDBMS, just about any will do. Take your pick of whatever you like really.

For the PDF... Sorry. You're hosed. Completely hosed.

The data in the PDF is in the first normal form (and *maybe* the second normal form). i.e. It's all mixed up and jumbled so as to be useless as a database.

Now, it *is* possible to write some software to go through all the different cases and to parse it all, but it is VERY far from being a trivial/easy task. Actually, it's pretty easy, but it is extremely time consuming.

The only solution I see that is remotely quick is to cut each line at the = sign then use the left side as a key and the right as a value. Keys can either be unique or not, but I would probably want to normalize things and try to force them to be unique.

Someone else may have a better idea.

As for extracting the data from the PDF, no clue. I loathe working with PDFs because they are a terminal format where once they are in that format, that's the end. Trying to get anything back out of them is a nightmare. You can try to save a PDF as a DOC to see just how terrifying the monstrosities that are produced are...

I couldn't save the file you had as a DOC because it has a <bad font>, whatever that means... (not even as HTML...)

I tried copying & pasting... No luck. Cells are collapsed to spaces, which effectively reduces any hope of the second normal form to the first normal form, i.e. useless.

It's better to go from database to PDF. That makes sense and is manageable. Going from PDF to anything is nigh hopeless.

kalos:
as for the pdfs, an example is this:
http://www.purolite.com/RelId/606346/ISvars/default/customized/uploads/pdfs/Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
-kalos (April 25, 2012, 02:12 AM)
--- End quote ---
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually?

And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?
-Ath (April 25, 2012, 02:30 AM)
--- End quote ---

each pdf describes one chemical
each word of a wordlist is one chemical
each chemical corresponds to one pdf
each wordlist is a bunch of different chemicals

chemicals in the wordlist are seperated by newlines

I can say it is easy to write a script that will gather the pdfs in a folder, that their filenames are mentioned in a wordlist, one seperated by newline from the other (and ofcourse without the file extension .pdf been written in the list)

but, after creating many folders with pdfs, where each folder will gather the pdfs of one wordlist, I need a way to know if a folder contains all the necessary pdfs mentioned in the wordlist, or if a pdf is missing (due to the fact that a word in the wordlist doesnt have a matching pdf)

Renegade:
Re-reading, I think I misunderstood before.

Are you simply looking to match chemical names, e.g. "chemical name", to the PDFs as in "chemical name.pdf"?

Here's a quick hack that will let you paste in a word list and check 1 folder:

ChemicalPdfCheck.exe (9.5 kB - downloaded 329 times.)

But that's all it does. It's VERY simple.

I don't have time right now to do much more, but here's what it is in case anyone wants to continue after you verify that it's the kind of thing you're looking for. (Maybe I can finish later.)

Here's the code for the 1 method to check:


--- Code: C# ---private void button1_Click(object sender, EventArgs e)        {            DialogResult dr = folderBrowserDialog1.ShowDialog();            if (dr != System.Windows.Forms.DialogResult.OK)            {                return;            }            // display the selected path            textBox2.Text = folderBrowserDialog1.SelectedPath;             // find the files that are missing            foreach (string s in textBox1.Lines)            {                if (File.Exists(folderBrowserDialog1.SelectedPath + @"\" + s + ".pdf"))                {                    // do nothing                }                else                {                    // got a missing file - add it to our list of missing ones                    textBox3.Text += s + "\r\n";                }            }         }

kalos:
much appreciated, thanks!

Renegade:
much appreciated, thanks!
-kalos (April 26, 2012, 05:33 AM)
--- End quote ---

So is that what in part what you're looking for?

I should mention, that is not optimized for large lists, e.g. if you have 50,000 or so in a list, then it's going to be sloooowwww...

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version