He explained it very well, I thought. Sometimes, people link the words http://www.google.com to http://www.malwareRus.com in PDFs. He wants to strip out the links.
Thanks Wraith. That is exactly the case.
Each employee gets submissions from hundreds of people every day. These contain attachments with things like biographies, resumes, etc. all having multiple references to various material posted on the web. There is no way anyone could check each submission for every link to be sure that each goes to where the TEXT says it does. Due to some of the other comments here, I am going to provide the LONG version.
Unfortunately, (or perhaps not) AVAST can (and does) do exactly that. On the system where all these submissions are stored I run a constant full-time scan on all the contents of the 5 TB Data drive. This scan returns anywhere from a couple dozen to a couple hundred entries per day of files that are quarantined. It took both myself and the AVAST techs some digging to understand why these pdf's were being removed at all as they are simply flagged as pdf:UrlMal-inf [Trj] or some other Malware
It turned out that the PDF itself has no issues and is clean for use by anyone. However, after some testing we discovered that the problem was tied to the links contained within the pdf. Even this was a mystery for a while as when typing in those links to check for problems, we found none. The sites they went to were legit and had no reason to be marked as "malware".
After I came up with the idea of removing the "ACTIVE" portions of the links and discovered that once a pdf had these removed the same pdf would then test as clean, It occurred to me that just as you can get "Spoofed" to a false link on a website, you could also get "spoofed" in a pdf if the embedded URL was NOT the same as the written URL.
Removing the "embedded" portion of the URL does not inconvenience anyone as adobe will recreate hyperlinks to any properly written URL anyway. I don't have time to track down the Why and How anyone would do this and that is not a problem I need to deal with. I just need the PDF's to stay available once they are filed for access. To do this, they must pass through the AV scanners as clean. To do that, they must not contain any links to sites that are blacklisted as being known "malware" hosts.
AVAST agrees with me that my solution would solve the problem. Eventually i am looking for a way to head this issue off before it ends up on my desk by the hundreds. An Outlook tool that would remove embedded links for all attachments would be preferable so that each employee would end up with only clean PDFs in the first place.
At this point, though, I need a way to do it retroactively on thousands of archived documents. My first thought was to ask for a way to disable the module of the Server Antivirus as it is not my place to be the "Guardian of the World" and the files themselves were safe. But I don't want to be responsible for misdirecting someone with a hidden URL such that they click a document labelled "Library_of_Congress_ #23Z3457231kb-N" and end up on "Malware-R-Us" instead if there I a way I could prevent it.
Since i had already found a way to fix the problem, and the fix does not have any ill effects on the documents themselves, I am just looking for a way to automate the "cleaning" and make this my small contribution to a Safer Internet.
To make practical, it must be done by a utility where i can set it to scan from A to Z and run 24/7 for as long as it takes to clean the entire 5 TB document Data drive and make sure that none of those stored documents contain any “hidden” redirects. The good links will work as written anyway as the hidden links are all that will be removed.