Author Topic: Pdf Management (Read 5030 times)

clean · « **on:** February 06, 2013, 03:25 PM »

In this thread,

http://www.kinook.co...hp?p=20178#post20178

Kyle, the man behind kinook, says, in answer to a complaint by a user that Ultra Recall doesn't find text in pdf's that are not imported into UR, but just referenced by UR (which is the much smarter way imo, cf. my "Passion" thread here),

"Note that some PDF files aren't parseable for text content. One PDF text parser vendor indicated, "Some PDFs will simply never parse the way you would expect them to for various reasons. There is NO PDF to text converter in the world that can work with every PDF file ever created. Even Adobe itself cannot convert all PDFs to text properly." The PDF parser we use works with most files we have tested, but I believe that if the text in the PDF file is encrypted or stored in a non-standard format, most tools can't parse text from them."

This brings the idea to me of the respective reliability of those pim's or other pdf managers (e.g. of the "university kind") that index referenced pdf files since it's clear as day that people who store many such papers want this to be done automatically and without then having to wonder about the quality of the built-up index, i.e. if you store pdf's, thinking you'll be able to search them afterwards, you obviously rely upon this manager building up the index properly - if afterwards, it will not find but some terms, in an aleatoric way, whilst it won't find others, but not even indicating to you that many terms could be there that it has not been able to index (= and to search now), you might be in deep trouble:

In ancient times, we had to read books and journals in order to scan for possible "hits"; if you just store pdf papers now, relying on the search feature of your pdf manager to produce these same hits, and if this manager doesn't find but some of them, you'll end up discarding papers that could have been central to your subject or "overlook" important parts in them.

Hence my questions:

- What are such pdf managers (except for the obvious ones, i.e. pdf "editors" from Adobe and its competitors), and which index referenced pdf's? (I know about UR, then TheBrain, and not many more.)

- What about the "pdf quality" of those standard search progs, e.g. Copernic, X1, dtSearch, etc.?

- Have you got some experience with these reliability questions, with what sw?

- Is there sw that will check the global file size or such of an indexed pdf, and inform you of possible discrepancies between this overall size and the possible sparcity of terms it found in it, in order to be indexed? I.e. are there progs that at least "warn you" when doing the indexing? (I mean when they encounter probs or when they assume there are probs?)

- If not, do these progs at least warn you, on indexing, when they can't "read" the file to index? (I mean when a file is "secured" or such and cannot be indexed at least, vs. problematic parts "only", in the first alternative.)

- Of course, Kyle from kinook in his cited answer tries to reduce the problem to such pdf files that "cannot" be read, but then, in other respects, Kyle is not into expensive components for his prog, so his pdf parser is probably not the very best on the market either (a similar prob in UR: the quality / lacking speed of its html storage, cf. the "specialists", Surfulater and especially WebResearch), hence my idea that there will certainly be big differences in the quality / completeness (or absence of it) of this indexing pdf's.

For the reasons cited above - today, you often rely upon technology to "read" for you, so you should better know if the technology you rely upon is trustworthy or not - this - rather overlooked - subject seems to be of high importance. Any insights or sources?

At the end of the day, it could come down to Adobe Acrobat and / or dtSearch, i.e. the most expensive, specialised offerings, but perhaps we get valuable info on more practical offerings - some pim (and its in-built pdf parser) could be as good in indexing pdf's as could be Acrobat... all the more so since many pdf's are not created with Acrobat, so the Adobe solution might not necessarily be the best of them all, for "reading" / indexing, and it's certainly not a very practical one -

Pdf editing is rather advanced now, and there are many low or medium priced offerings. But reliable pdf M seems to be a thing needing further discussion, especially in the light of the possible harms of

a) absence of indexing, and
b) partial indexing only,
when in neither case the user is informed of the lacking index entries.

EDIT: I'd wish to add that many pdf's are compounds of various sources, i.e. different parts in it might have been created by very different means. So within the same pdf, different "cases" of processing needs might apply, and I hope a "good" parser will properly evaluate this and react accordingly, whilst a cheap parser probably will just skip the "difficult" parts, and worse, without telling you it just skipped them.

clean · « **Reply #1 on:** February 08, 2013, 05:28 PM »

Citation from

https://www.donation...x.php?topic=2434.725 :

Darwin
Re: What is the currently best Desktop Search software?
« Reply #737 on: October 15, 2010, 09:12:01 AM » Quote
"Hmph! I downloaded the pdf and saved it in a folder that X1 is supposed to be watching and fully indexing. I then set X1 to run whether the computer is in use or not and did a manual indexing... three times. As far as I can tell, not only doesn't X1 index the complete file, it's not indexing this one AT ALL!"

Well, that's what I'm speaking about here.

And of course, my post above wasn't precise enough, in sw like Acrobat you "open" a pdf, THEN "search" it, one hit by one, which is useless for pdf M, of course, even if the search capability is ok. I mean, if you have to open 500 files per hand, in order to search them one by one then, even the fact that you don't overlook something will not counterweigh the time you spend with such almost "visual checking everything".

There's also Lookeen, the former Outlook specialist that now doesn't cost 40 bucks anymore, but 60 euro, about 80 bucks, but that searches MS files, and pdf's (also independently from mail).

The thread above has got 31 pages, the search term "search" in DC brings 200 pages (!) of hits, and since it's exactly 200, I assume there are more but which are clipped.

On the other hand, "search pdf" here brings 1 page of hits that won't make you travel far.

I think the pdf format is by far the most problematic format for any search tool, since it's not "stable" / "regular" in the sense of "either the tool can read this format, or it cannot", so here the risk of a search hiding terms that should be hits, is more prominent than with any other file format.

Thus, I'd be quite interested in knowing of experiences with searching pdf's. X1 is visually "best", or at least very pleasant, so it'd be a shame if its pdf search capability wasn't that good, and then, the price of dtsearch (without knowing if it's really better for pdf's) is 5 times higher. (And yes, I think Copernic is ugly and not user-friendly, trialled it several times and was put off by its seeming inability to let you choose what will be indexed - I abhor sw that takes over my system!

Btw, in the other thread, people complain about X1 permanently indexing. I suppose that if you only make it index some folders, with "real material", any good search sw will index any new entries into these folders, and then stop indexing, i.e. you import another big pdf, it'll index it, then another one, but neither X1 nor other good search sw will need to re-build the index again and again, on such occasions, so at the end of the day, it's all about your choosing those folders to be indexed, and then an indexing search tool will behave well and not at all wear out your hdd!

After all, it all comes down to COMPLETE indexing upon which you could rely then, and to have some decent "hit table".

If not even X1 produces reliable results (IF I said, I don't know anything about it) for pdf's, how could you imagine that UR could produce them? And my point is, instead of "relying" on such in-built, "handy" but unreliable pdf search "functionality", you'd better have parallel systems, with a reliable pdf search tool to manage your pdf's.

clean · « **Reply #2 on:** February 09, 2013, 09:04 AM »

Interesting thread here (more than 4 pages I'm afraid):

http://answers.micro...-3e01349dccca?page=5

Even Adobe products can search several pdf's in a row, but it doesn't seem to be fast.

Have a look here:

http://www.foxitsoft...lter/performance.php

Foxit pdf iFilter is 20 bucks per seat; if it works, that should be very reasonable.

When having crawled many more "search pdf index" sites, I'll do perhaps some testing, just indexing my current pdf's of all sorts with multiple progs and compare the results (and no, there does NOT seem to be any such prog that will tell you it can't properly index a pdf...). Prob here, in order to not affect the indexing with one prog by the previous installation of a competing prog, I'd have to to reset my comp to a previous state between every trialling (takes my 30 minutes each time). But from the above thread you'll have understood that the pdf format is the worst file format you can get. All the more so it'd be helpful to know which indexing search tools (and in which global circumstances of your system) will deliver reliable results. I would have expected many more insights into this format from the "academic sw" side (where reliability of pdf searches would be crucial), but no...

Author Topic: Pdf Management (Read 5030 times)

clean

Pdf Management

clean

Re: Pdf Management

clean

Re: Pdf Management