Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 04, 2016, 08:23:51 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: extract text from pdf  (Read 4785 times)

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,473
    • View Profile
    • Donate to Member
extract text from pdf
« on: July 16, 2009, 11:14:36 AM »
hello

can you help me please extract the text from this pdf file http://stashbox.org/571279/mydocument.pdf
1) I would like each line to appear in one line (by line, I mean the text between "begining of line symbol"  and "end of line symbol")
for some reason all the pdf editor software I tried they cannot find correctly where the line should end, so they break the lines
2) I don't want the columns to appear (no pdf editor has the option to ignore the columns) nore the page numbers at the end of the pages, etc neither those arrows at the begining of each line
3) all the other special characters of this pdf can be typed in a rich text file
for some reason all the pdf editor software I tried cannot reproduce them accurately, for example index letters etc
also, the spanish characters cannot be recognized and reproduced well (á, é, í, etc)

I just want the text (but accurately), not the page appearance

can anyone help me please?


thanks

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #1 on: July 16, 2009, 01:48:13 PM »
It looks like you can use pdf2text with the -raw option to remove the columns.

Then use your favorite search & replace tool with regex support.

Replace '\r\n([^8])' with ' $1' and that should get you the stuff on one line.

Then remove the 8's at the beginning of the line. Replace '^8' with ''.

Then do some other regex searches to remove the garbage like

3 TaTP / Ta (an)TP

and

Ta (an)TP / Ta (an)TP 4

etc.

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,473
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #2 on: July 16, 2009, 03:39:04 PM »
thanks
where do I download the pdf2text?

Steven Avery

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 847
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #3 on: July 16, 2009, 08:27:28 PM »
Hi Folks,

Here are some PDF to Txt, probably including first the one above (which is the first or last here) which tinjaw indicates can accomplish what you want.

Pdf2Txt ($35 - 30 day full use evaluation)
http://www.pdf2txt.com/
"PDF2TXT converts Adobe Acrobat PDF to plain text."

Zilla PDF to TXT converter  (freeware)
http://www.pdfzilla....o_txt_converter.html
"Zilla PDF to TXT Converter is a freeware to convert pdf files to plain text format files in batch mode. Zilla PDF to TXT Converter also support convert specific pages range to txt files."

XPdf (Open Source freeware)
http://www.foolabs.com/xpdf/
"Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. "

PDFTractor - $39
http://www.jimisoft....m/en/pdftractor.html
"PDFTractor can extract simple-text from PDF files"

PDF2TXT - $38
http://www.verypdf.c.../pdf2txt/pdf2txt.htm
"PDF2TXT (PDF to Text) software does extract text from PDF files"

Shalom,
Steven
« Last Edit: July 16, 2009, 10:46:57 PM by Steven Avery »

sajman99

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 664
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #4 on: July 16, 2009, 09:44:54 PM »
Another freeware you might try is A-PDF Text Extractor.
http://www.a-pdf.com/text : "A-PDF Text Extractor is a free utility designed to extract text from Adobe PDF files for use in other applications. There are three mode of output text: In PDF Order, Smart Rearrange and With Position... A command line version is available also to allow you to call in your program or script."

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #5 on: July 17, 2009, 08:53:40 AM »
thanks
where do I download the pdf2text?

Steve had it...

XPdf (Open Source freeware)
http://www.foolabs.com/xpdf/
"Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. "

kalos

  • Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 1,473
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #6 on: July 18, 2009, 05:03:00 PM »
all the products you mentioned cannot convert special characters like spanish tilda, stress etc

as for the xpdf: good, but there are some issues, for example:

1) I get this character:
¤
what is this???

2) it cannot remove/ignore the headers and footers

3) it cannot save the bold characters, the index characters etc

4) last, how do I make it convert all my a.pdf, b.pdf etc files at once? I type c:\pdf\*.pdf and it doesn't do anything (do I have to convert one by one?)

thanks
« Last Edit: July 18, 2009, 05:12:45 PM by kalos »

tinjaw

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,927
    • View Profile
    • Donate to Member
Re: extract text from pdf
« Reply #7 on: July 18, 2009, 11:26:46 PM »
Unfortunately you aren't going to get any formatting. Afterall you are doing PDF to TXT and TXT has no formatting. The headers and the footers are the things I mentioned you would have to strip out. Those are the 3 TaTP / Ta (an)TP and Ta (an)TP / Ta (an)TP 4 that I mentioned. I imagine that the would either be few enough to strip out by hand or some glob or regex should handle it. As for doing them all at once, you will need to write a batch file to handle that as pdftoptext appears to only take one file as an argument. No, this is fully automated, but it sure beats retying all that stuff in.