Welcome Guest.   Make a donation to an author on the site August 28, 2014, 10:23:40 PM  *

Please login or register.
Or did you miss your validation email?


Login with username and password (forgot your password?)
Why not become a lifetime supporting member of the site with a one-time donation of any amount? Your donation entitles you to a ton of additional benefits, including access to exclusive discounts and downloads, the ability to enter monthly free software drawings, and a single non-expiring license key for all of our programs.


You must sign up here before you can post and access some areas of the site. Registration is totally free and confidential.
 
The N.A.N.Y. Challenge 2010! Download 24 custom programs!
   
   Forum Home   Thread Marks Chat! Downloads Search Login Register  
Pages: [1]   Go Down
  Reply  |  New Topic  |  Print  
Author Topic: automate database building  (Read 2963 times)
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« on: April 23, 2012, 11:14:43 AM »

hello!

I have a word list (different each time), where each of the words (or lines, if you prefer) is the name of a chemical substance
I have a huge folder of pdf files of the specifications of every substance

what I do is to gather all those specific pdfs of the specific substances in the list, to make bunch of pdf files that corresponds to every substance of the specific wordlist

but I need to automate that process, so can you suggest me please:

1) how can automatically bind the wordlist with the according pdf files
2) how can I create a single file (not sure what kind of file) that will contain all these relative files (compression files, rar, zip, etc would not be convenient, because they will require time to open)
3) how can I check at each point, if a specific project (eg the gathering of the pdfs of a wordlist) is incomplete and which exactly files are missing (so to add them manually)

can you suggest me an idea, a procedure, an excel file to build, and ahk script to build, a program, anything!

thanks!
Logged
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #1 on: April 24, 2012, 01:40:06 PM »

any suggestion???
Logged
Ath
Supporting Member
**
Posts: 2,198



see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #2 on: April 24, 2012, 02:24:47 PM »

Probably needs more explanation, because I can't comprehend, even after reading it 3 times in a row embarassed
Logged

rjbull
Charter Member
***
Posts: 2,748

View Profile Give some DonationCredits to this forum member
« Reply #3 on: April 24, 2012, 03:04:16 PM »

Ath's right, needs clarification.  Maybe an example?

Are your PDFs named according to the chemical itself, i.e. the same name as in the word list?

You can (usually) concatenate PDFs from the command line using pdftk.  So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them.
Logged
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #4 on: April 24, 2012, 04:03:03 PM »

Are your PDFs named according to the chemical itself, i.e. the same name as in the word list?

yes

You can (usually) concatenate PDFs from the command line using pdftk.  So if your PDFs are named as in your word list, you could use the word list to build a batch file to concatenate them.

I do not want to combine the pdfs and make one pdf
I want to somehow attach them with the word list :/
Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #5 on: April 24, 2012, 05:07:26 PM »

The only thing I can think of is building some custom software/script to parse the text and create the database like that.

But without knowing much more, that's about all I can say.

For a single file, maybe store the PDFs in the DB as a blob in a single table?
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #6 on: April 24, 2012, 05:49:56 PM »

You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc
Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #7 on: April 24, 2012, 08:26:23 PM »

You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #8 on: April 25, 2012, 02:12:04 AM »

You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?

but, cant EXCEL store a pdf or any other type, file in a cell?
if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much)

as for the pdfs, an example is this:
http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
Logged
Ath
Supporting Member
**
Posts: 2,198



see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #9 on: April 25, 2012, 02:30:59 AM »

as for the pdfs, an example is this:
http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually?

And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?
Logged

Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #10 on: April 25, 2012, 02:41:21 AM »

You mean to use EXCEL files?
That sounds good., and are there any kind of files similar? That can store blobs., etc

Excel doesn't use data types like that, so you can't store a blob in it. Blobs are for RDBMSes.

Can you post a PDF?

but, cant EXCEL store a pdf or any other type, file in a cell?
if not, which specific RDBM would you suggest? (for a veeery simple task as this, that doesnt require to learn much)

as for the pdfs, an example is this:
http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)

Well, yes, Excel *can* have a file embedded, but not in a cell (AFAIK).

For an RDBMS, just about any will do. Take your pick of whatever you like really.

For the PDF... Sorry. You're hosed. Completely hosed.

The data in the PDF is in the first normal form (and *maybe* the second normal form). i.e. It's all mixed up and jumbled so as to be useless as a database.

Now, it *is* possible to write some software to go through all the different cases and to parse it all, but it is VERY far from being a trivial/easy task. Actually, it's pretty easy, but it is extremely time consuming.

The only solution I see that is remotely quick is to cut each line at the = sign then use the left side as a key and the right as a value. Keys can either be unique or not, but I would probably want to normalize things and try to force them to be unique.

Someone else may have a better idea.

As for extracting the data from the PDF, no clue. I loathe working with PDFs because they are a terminal format where once they are in that format, that's the end. Trying to get anything back out of them is a nightmare. You can try to save a PDF as a DOC to see just how terrifying the monstrosities that are produced are...

I couldn't save the file you had as a DOC because it has a <bad font>, whatever that means... (not even as HTML...)

I tried copying & pasting... No luck. Cells are collapsed to spaces, which effectively reduces any hope of the second normal form to the first normal form, i.e. useless.

It's better to go from database to PDF. That makes sense and is manageable. Going from PDF to anything is nigh hopeless.
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #11 on: April 25, 2012, 12:53:56 PM »

as for the pdfs, an example is this:
http://www.purolite.com/R.../Regen%20Chem%20Specs.pdf
(renamed as Hydrochloric Muriatic Acid.pdf)
Does that mean there are more then 1 chemical components described in 1 pdf file, or is this the kind of resulting pdf you want to have output eventually?

And how are the chemical components in the 'word list' separated from each other, by comma's or each on a new line? Or just try to find a word as a pdf, if not there, add the next word and retry, etc.?

each pdf describes one chemical
each word of a wordlist is one chemical
each chemical corresponds to one pdf
each wordlist is a bunch of different chemicals

chemicals in the wordlist are seperated by newlines

I can say it is easy to write a script that will gather the pdfs in a folder, that their filenames are mentioned in a wordlist, one seperated by newline from the other (and ofcourse without the file extension .pdf been written in the list)

but, after creating many folders with pdfs, where each folder will gather the pdfs of one wordlist, I need a way to know if a folder contains all the necessary pdfs mentioned in the wordlist, or if a pdf is missing (due to the fact that a word in the wordlist doesnt have a matching pdf)

Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #12 on: April 26, 2012, 01:13:26 AM »

Re-reading, I think I misunderstood before.

Are you simply looking to match chemical names, e.g. "chemical name", to the PDFs as in "chemical name.pdf"?

Here's a quick hack that will let you paste in a word list and check 1 folder:

* ChemicalPdfCheck.exe (9.5 KB - downloaded 49 times.)

But that's all it does. It's VERY simple.

I don't have time right now to do much more, but here's what it is in case anyone wants to continue after you verify that it's the kind of thing you're looking for. (Maybe I can finish later.)

Here's the code for the 1 method to check:

Formatted for C# with the GeSHI Syntax Highlighter [copy or print]
  1.        private void button1_Click(object sender, EventArgs e)
  2.        {
  3.            DialogResult dr = folderBrowserDialog1.ShowDialog();
  4.            if (dr != System.Windows.Forms.DialogResult.OK)
  5.            {
  6.                return;
  7.            }
  8.            // display the selected path
  9.            textBox2.Text = folderBrowserDialog1.SelectedPath;
  10.  
  11.            // find the files that are missing
  12.            foreach (string s in textBox1.Lines)
  13.            {
  14.                if (File.Exists(folderBrowserDialog1.SelectedPath + @"\" + s + ".pdf"))
  15.                {
  16.                    // do nothing
  17.                }
  18.                else
  19.                {
  20.                    // got a missing file - add it to our list of missing ones
  21.                    textBox3.Text += s + "\r\n";
  22.                }
  23.            }
  24.  
  25.        }

Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #13 on: April 26, 2012, 05:33:16 AM »

much appreciated, thanks!
Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #14 on: April 26, 2012, 07:06:45 AM »

much appreciated, thanks!

So is that what in part what you're looking for?

I should mention, that is not optimized for large lists, e.g. if you have 50,000 or so in a list, then it's going to be sloooowwww...
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #15 on: April 26, 2012, 10:24:34 AM »

much appreciated, thanks!
So is that what in part what you're looking for?
this is a part of the whole procedure

Now, I think I am looking for a filetype or some kind of program that will be able to display the below info. I imagine it as a table, but it can be anything.
- it will display an index/list of entries/fields:
1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6
etc
- each entry/field will have a specific title (eg chemical1) and "inside" it, a specific file can be stored (pdf, txt, jpg, png files)
- if a field/entry of the index/list (eg. 1.3) has not any file stored, it will be marked as red, which will indicate that it's empty, otherwise, it will be marked as green, which means that it has a file stored in it
- all the files above, will be able to be extracted in a folder and subfolders, according to the list/index

I dont know if EXCEL can do this, I bet AHK can do this, if it is easy, I will try, but is there a ready solution ?
Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #16 on: April 26, 2012, 10:30:44 AM »


1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6

Where does that information come from?
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #17 on: April 26, 2012, 11:10:13 AM »


1) chemical1
  1.1) chemical2
  1.2) chemical3
  1.3) chemical4
2) chemical5
3) chemical6

Where does that information come from?

It is the wordlist, a textfile
Logged
Renegade
Charter Member
***
Posts: 11,163



Tell me something you don't know...

see users location on a map View Profile WWW Give some DonationCredits to this forum member
« Reply #18 on: April 26, 2012, 11:26:49 AM »

It is the wordlist, a textfile

But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant?
Logged

Slow Down Music - Where I commit thought crimes...

Freedom is the right to be wrong, not the right to do wrong. - John Diefenbaker
kalos
Member
**
Posts: 999

View Profile Give some DonationCredits to this forum member
« Reply #19 on: April 26, 2012, 12:22:22 PM »

t
It is the wordlist, a textfile

But why is there a hierarchy in the word file? Isn't it 1 entry per line? Are they indented with a tab or 4 spaces or something? What's the logic there for how it applies to the PDFs? Or is the 1.1, 1.2 stuff irrelevant?


there is some hierarchy and some categorization, but I thought that, for begining, a simple wordlist would be suffice, and afterwards I would add the hierarchy and categories, etc
Logged
Pages: [1]   Go Up
  Reply  |  New Topic  |  Print  
 
Jump to:  
   Forum Home   Thread Marks Chat! Downloads Search Login Register  

DonationCoder.com | About Us
DonationCoder.com Forum | Powered by SMF
[ Page time: 0.061s | Server load: 0.09 ]