topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 10:55 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: PdfMasher, fair priced Open Source  (Read 4737 times)

Curt

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 7,566
    • View Profile
    • Donate to Member
PdfMasher, fair priced Open Source
« on: September 10, 2011, 07:21 AM »
... the developer at hardcoded.net  is running in something he calls "Fairware":
http://open.hardcoded.net/

PdfMasher, http://www.hardcoded.net/pdfmasher/, could be worth a test:


PdfMasher is a tool to convert PDF articles (newspaper, academic) to MOBI or EPUB documents. Most ebook readers support PDF files natively, but it's often a real pain to read those documents because we don't have font size control over the document like we have with native ebooks. In many cases, we have to use the zooming feature and it's just a pain.

Another drawback of PDFs on ebook readers is that annotations are not supported.

...
PdfMasher asks the user about the role of each piece of text, and does it in an efficient
manner:

      Your PDF has a header on each page and you don't want them to litter your text? Sort text elements by Y-position (thus grouping them all together), shift select the elements and flag them as ignored. They will not appear on your final HTML.

      Your PDF has footnotes on many pages? Sort your elements by text content (thus grouping all elements with the text starting with a number together) and flag them as footnotes. They will be moved to the end of the document, and PdfMasher will try to create hyperlinks to footnote references.

...
PdfMasher does NOT preserve style and images. PDF is evil and just the task of extracting text from it while preserving the flow is daunting. I receive e-mails from people disappointed that their styling is lost in the process. Sorry, PdfMasher's focus is not on style preservation (hence the "masher" part of the name) and if it's something you need, PdfMasher is probably not for you.

Platforms:
Mac OS X: 10.6 and up (Snow Leopard or Lion).
Linux (32bit) Ubuntu 11.04 and Linux (64bit) Ubuntu 11.04
Windows XP/Vista/Win7

Although it's quite capable already, PdfMasher is still in early development.
Current Version: 0.6.2
Price: Fair


2011-09-10_131758.gif
http://www.hardcoded.net/pdfmasher/
installs Microsoft Visual++ 2008 Redestrb.

The two videos are for older versions.

---
TEST-page: http://www.monde-diplomatique.fr/


PdfMasher's OCR must be good; almost all of the produced text was correct. The author is probably not working as a pedagogic trained teacher... the GUI and the procedures could be better layed out - quite a bit better. As an example, if you want to remove some text from the article, you'll have to mark the text block in question, and move the mouse to the other side of the program and click 'ignore', instead of merely pressing delete! This is far too slow a procedure, and made me tired of the program before time... But because I took the time to watch the videos and read about it all, it was easy to use the program and create an e-book. As an e-book the resulting file is very useful. The produced nag screen looks like a turn-off for freeware hunters.

Thumbnails, click to enlarge:

2011-09-10_table.gifPdfMasher, fair priced Open Source
2011-09-10_page+TEXT.gifPdfMasher, fair priced Open Source
2011-09-10_page+reordermode.gifPdfMasher, fair priced Open Source


Produced file:

2011-09-10_NAG.gifPdfMasher, fair priced Open Source



Parts of this program are excellent. But as the author said: "Although it's quite capable already, PdfMasher is still in early development"; Program removed.
« Last Edit: September 10, 2011, 07:39 AM by Curt »

hsoft

  • Honorary Member
  • Joined in 2011
  • **
  • default avatar
  • Posts: 15
    • View Profile
    • Donate to Member
Re: PdfMasher, fair priced Open Source
« Reply #1 on: September 10, 2011, 08:17 AM »
Heh, it's funny that this thread about PdfMasher shows roughly at the same time as this other one about fairware.

I'm the author. It's a fair review, thanks, but there's two things I'd like to point out.

1. The keybinding to set selected items as ignored is "i", you don't have to click the button each time. I guess that a "del" keybinding would make sense as well.

2. PdfMasher doesn't do OCR, it only deals with text information within the PDF. If your PDF is "image only" (if you can't copy/paste text in your pdf reader), PdfMasher can't process it.