In this thread,
http://www.kinook.co...hp?p=20178#post20178Kyle, the man behind kinook, says, in answer to a complaint by a user that Ultra Recall doesn't find text in pdf's that are not imported into UR, but just referenced by UR (which is the much smarter way imo, cf. my "Passion" thread here),
"Note that some PDF files aren't parseable for text content. One PDF text parser vendor indicated, "Some PDFs will simply never parse the way you would expect them to for various reasons. There is NO PDF to text converter in the world that can work with every PDF file ever created. Even Adobe itself cannot convert all PDFs to text properly." The PDF parser we use works with most files we have tested, but I believe that if the text in the PDF file is encrypted or stored in a non-standard format, most tools can't parse text from them."
This brings the idea to me of the respective reliability of those pim's or other pdf managers (e.g. of the "university kind") that index referenced pdf files since it's clear as day that people who store many such papers want this to be done automatically and without then having to wonder about the quality of the built-up index, i.e. if you store pdf's, thinking you'll be able to search them afterwards, you obviously rely upon this manager building up the index properly - if afterwards, it will not find but some terms, in an aleatoric way, whilst it won't find others, but not even indicating to you that many terms could be there that it has not been able to index (= and to search now), you might be in deep trouble:
In ancient times, we had to read books and journals in order to scan for possible "hits"; if you just store pdf papers now, relying on the search feature of your pdf manager to produce these same hits, and if this manager doesn't find but some of them, you'll end up discarding papers that could have been central to your subject or "overlook" important parts in them.
Hence my questions:
- What are such pdf managers (except for the obvious ones, i.e. pdf "editors" from Adobe and its competitors), and which index referenced pdf's? (I know about UR, then TheBrain, and not many more.)
- What about the "pdf quality" of those standard search progs, e.g. Copernic, X1, dtSearch, etc.?
- Have you got some experience with these reliability questions, with what sw?
- Is there sw that will check the global file size or such of an indexed pdf, and inform you of possible discrepancies between this overall size and the possible sparcity of terms it found in it, in order to be indexed? I.e. are there progs that at least "warn you" when doing the indexing? (I mean when they encounter probs or when they assume there are probs?)
- If not, do these progs at least warn you, on indexing, when they can't "read" the file to index? (I mean when a file is "secured" or such and cannot be indexed at least, vs. problematic parts "only", in the first alternative.)
- Of course, Kyle from kinook in his cited answer tries to reduce the problem to such pdf files that "cannot" be read, but then, in other respects, Kyle is not into expensive components for his prog, so his pdf parser is probably not the very best on the market either (a similar prob in UR: the quality / lacking speed of its html storage, cf. the "specialists", Surfulater and especially WebResearch), hence my idea that there will certainly be big differences in the quality / completeness (or absence of it) of this indexing pdf's.
For the reasons cited above - today, you often rely upon technology to "read" for you, so you should better know if the technology you rely upon is trustworthy or not - this - rather overlooked - subject seems to be of high importance. Any insights or sources?
At the end of the day, it could come down to Adobe Acrobat and / or dtSearch, i.e. the most expensive, specialised offerings, but perhaps we get valuable info on more practical offerings - some pim (and its in-built pdf parser) could be as good in indexing pdf's as could be Acrobat... all the more so since many pdf's are not created with Acrobat, so the Adobe solution might not necessarily be the best of them all, for "reading" / indexing, and it's certainly not a very practical one -
Pdf editing is rather advanced now, and there are many low or medium priced offerings. But reliable pdf M seems to be a thing needing further discussion, especially in the light of the possible harms of
a) absence of indexing, and
b) partial indexing only,
when in neither case the user is informed of the lacking index entries.
EDIT: I'd wish to add that many pdf's are compounds of various sources, i.e. different parts in it might have been created by very different means. So within the same pdf, different "cases" of processing needs might apply, and I hope a "good" parser will properly evaluate this and react accordingly, whilst a cheap parser probably will just skip the "difficult" parts, and worse, without telling you it just skipped them.