Website Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • August 31, 2015, 05:51:11 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: automate database building  (Read 3520 times)

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
automate database building
« on: April 23, 2012, 11:14:43 AM »
hello!

I have a word list (different each time), where each of the words (or lines, if you prefer) is the name of a chemical substance
I have a huge folder of pdf files of the specifications of every substance

what I do is to gather all those specific pdfs of the specific substances in the list, to make bunch of pdf files that corresponds to every substance of the specific wordlist

but I need to automate that process, so can you suggest me please:

1) how can automatically bind the wordlist with the according pdf files
2) how can I create a single file (not sure what kind of file) that will contain all these relative files (compression files, rar, zip, etc would not be convenient, because they will require time to open)
3) how can I check at each point, if a specific project (eg the gathering of the pdfs of a wordlist) is incomplete and which exactly files are missing (so to add them manually)

can you suggest me an idea, a procedure, an excel file to build, and ahk script to build, a program, anything!

thanks!

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #1 on: April 24, 2012, 01:40:06 PM »
any suggestion???

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 2,532
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #2 on: April 24, 2012, 02:24:47 PM »
Probably needs more explanation, because I can't comprehend, even after reading it 3 times in a row :-[

rjbull

  • Charter Member
  • Joined in 2005
  • ***
  • gravatar avatar
  • Posts: 2,853
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #3 on: April 24, 2012, 03:04:16 PM »
Ath's right, needs clarification.  Maybe an example?

Are your PDFs named according to the chemical itself, i.e. the same name as in the word list?

You can (usually) concatenate PDFs from the command line using pdftk.  So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them.

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #4 on: April 24, 2012, 04:03:03 PM »
Are your PDFs named according to the chemical itself, i.e. the same name as in the word list?

yes

You can (usually) concatenate PDFs from the command line using pdftk.  So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them.

I do not want to combine the pdfs and make one pdf
I want to somehow attach them with the word list :/

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #5 on: April 24, 2012, 05:07:26 PM »
The only thing I can think of is building some custom software/script to parse the text and create the database like that.

But without knowing much more, that's about all I can say.

For a single file, maybe store the PDFs in the DB as a blob in a single table?
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #6 on: April 24, 2012, 05:49:56 PM »
You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #7 on: April 24, 2012, 08:26:23 PM »
You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #8 on: April 25, 2012, 02:12:04 AM »
You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?

but, cant EXCEL store a pdf or any other type, file in a cell?
if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much)

as for the pdfs, an example is this:
http://www.purolite....n%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 2,532
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #9 on: April 25, 2012, 02:30:59 AM »
as for the pdfs, an example is this:
http://www.purolite....n%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually?

And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #10 on: April 25, 2012, 02:41:21 AM »
You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?

but, cant EXCEL store a pdf or any other type, file in a cell?
if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much)

as for the pdfs, an example is this:
http://www.purolite....n%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)

Well, yes, Excel *can* have a file embedded, but not in a cell (AFAIK).

For an RDBMS, just about any will do. Take your pick of whatever you like really.

For the PDF... Sorry. You're hosed. Completely hosed.

The data in the PDF is in the first normal form (and *maybe* the second normal form). i.e. It's all mixed up and jumbled so as to be useless as a database.

Now, it *is* possible to write some software to go through all the different cases and to parse it all, but it is VERY far from being a trivial/easy task. Actually, it's pretty easy, but it is extremely time consuming.

The only solution I see that is remotely quick is to cut each line at the = sign then use the left side as a key and the right as a value. Keys can either be unique or not, but I would probably want to normalize things and try to force them to be unique.

Someone else may have a better idea.

As for extracting the data from the PDF, no clue. I loathe working with PDFs because they are a terminal format where once they are in that format, that's the end. Trying to get anything back out of them is a nightmare. You can try to save a PDF as a DOC to see just how terrifying the monstrosities that are produced are...

I couldn't save the file you had as a DOC because it has a <bad font>, whatever that means... (not even as HTML...)

I tried copying & pasting... No luck. Cells are collapsed to spaces, which effectively reduces any hope of the second normal form to the first normal form, i.e. useless.

It's better to go from database to PDF. That makes sense and is manageable. Going from PDF to anything is nigh hopeless.
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #11 on: April 25, 2012, 12:53:56 PM »
as for the pdfs, an example is this:
http://www.purolite....n%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually?

And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?

each pdf describes one chemical
each word of a wordlist is one chemical
each chemical corresponds to one pdf
each wordlist is a bunch of different chemicals

chemicals in the wordlist are seperated by newlines

I can say it is easy to write a script that will gather the pdfs in a folder, that their filenames are mentioned in a wordlist, one seperated by newline from the other (and ofcourse without the file extension .pdf been written in the list)

but, after creating many folders with pdfs, where each folder will gather the pdfs of one wordlist, I need a way to know if a folder contains all the necessary pdfs mentioned in the wordlist, or if a pdf is missing (due to the fact that a word in the wordlist doesnt have a matching pdf)


Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #12 on: April 26, 2012, 01:13:26 AM »
Re-reading, I think I misunderstood before.

Are you simply looking to match chemical names, e.g. "chemical name", to the PDFs as in "chemical name.pdf"?

Here's a quick hack that will let you paste in a word list and check 1 folder:

* ChemicalPdfCheck.exe (9.5 kB - downloaded 76 times.)

But that's all it does. It's VERY simple.

I don't have time right now to do much more, but here's what it is in case anyone wants to continue after you verify that it's the kind of thing you're looking for. (Maybe I can finish later.)

Here's the code for the 1 method to check:

Code: C#
  1.         private void button1_Click(object sender, EventArgs e)
  2.         {
  3.             DialogResult dr = folderBrowserDialog1.ShowDialog();
  4.             if (dr != System.Windows.Forms.DialogResult.OK)
  5.             {
  6.                 return;
  7.             }
  8.             // display the selected path
  9.             textBox2.Text = folderBrowserDialog1.SelectedPath;
  10.  
  11.             // find the files that are missing
  12.             foreach (string s in textBox1.Lines)
  13.             {
  14.                 if (File.Exists(folderBrowserDialog1.SelectedPath + @"\" + s + ".pdf"))
  15.                 {
  16.                     // do nothing
  17.                 }
  18.                 else
  19.                 {
  20.                     // got a missing file - add it to our list of missing ones
  21.                     textBox3.Text += s + "\r\n";
  22.                 }
  23.             }
  24.  
  25.         }
  26.  

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #13 on: April 26, 2012, 05:33:16 AM »
much appreciated, thanks!

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #14 on: April 26, 2012, 07:06:45 AM »
much appreciated, thanks!

So is that what in part what you're looking for?

I should mention, that is not optimized for large lists, e.g. if you have 50,000 or so in a list, then it's going to be sloooowwww...
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #15 on: April 26, 2012, 10:24:34 AM »
much appreciated, thanks!
So is that what in part what you're looking for?
this is a part of the whole procedure

Now, I think I am looking for a filetype or some kind of program that will be able to display the below info. I imagine it as a table, but it can be anything.
- it will display an index/list of entries/fields:
1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6
etc
- each entry/field will have a specific title (eg chemical1) and "inside" it, a specific file can be stored (pdf, txt, jpg, png files)
- if a field/entry of the index/list (eg. 1.3) has not any file stored, it will be marked as red, which will indicate that it's empty, otherwise, it will be marked as green, which means that it has a file stored in it
- all the files above, will be able to be extracted in a folder and subfolders, according to the list/index

I dont know if EXCEL can do this, I bet AHK can do this, if it is easy, I will try, but is there a ready solution ?

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #16 on: April 26, 2012, 10:30:44 AM »

1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6

Where does that information come from?
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #17 on: April 26, 2012, 11:10:13 AM »

1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6

Where does that information come from?

It is the wordlist, a textfile

Renegade

  • Charter Member
  • Joined in 2005
  • ***
  • Posts: 12,785
  • Tell me something you don't know...
    • View Profile
    • Renegade Minds
    • Donate to Member
Re: automate database building
« Reply #18 on: April 26, 2012, 11:26:49 AM »
It is the wordlist, a textfile

But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant?
Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker

kalos

  • Member
  • Joined in 2006
  • **
  • gravatar avatar
  • Posts: 1,369
    • View Profile
    • Donate to Member
Re: automate database building
« Reply #19 on: April 26, 2012, 12:22:22 PM »
t
It is the wordlist, a textfile

But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant?


there is some hierarchy and some categorization, but I thought that, for begining, a simple wordlist would be suffice, and afterwards I would add the hierarchy and categories, etc