ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

Desktop tool for programmatically removing URL's from PDF files

<< < (2/3) > >>

wraith808:
From the questions he's asked before related to his company, I don't question that.  ;D  He seems to have a hell of a group of cats to herd.

Shades:
https://appligent.com/server-software/redax-enterprise-server/

Not for free, but the redax server product of the linked company looks to be up to your task.

dr_andus:
This might do it as well:PDF Document Cleanup | Debenu PDF Tools

Shades:
http://pdfedit.petricek.net/index_e.html  - this one should be able to what you ask for free.

Other non-adobe PDF editors that might prove useful:
http://www.tracker-software.com/product/pdf-xchange-editor
http://code-industry.net/pdfeditor.php
http://www.4dots-software.com/free-pdf-metadata-editor/

questorfla:
He explained it very well, I thought.  Sometimes, people link the words http://www.google.com to http://www.malwareRus.com in PDFs.  He wants to strip out the links.
-wraith808 (August 25, 2016, 10:24 AM)
--- End quote ---

Thanks Wraith.  That is exactly the case.

Each employee gets submissions from hundreds of people every day.  These contain attachments with things like biographies, resumes, etc. all having multiple references to various material posted on the web.  There is no way anyone could check each submission for every link to be sure that each goes to where the TEXT says it does.  Due to some of the other comments here, I am going to provide the LONG version.

Unfortunately, (or perhaps not) AVAST can (and does) do exactly that.  On the system where all these submissions are stored I run a constant full-time scan on all the contents of the 5 TB Data drive.  This scan returns anywhere from a couple dozen to a couple hundred entries per day of files that are quarantined.  It took both myself and the AVAST techs some digging to understand why these pdf's were being removed at all as they are simply flagged as  pdf:UrlMal-inf [Trj] or some other Malware

It turned out that the PDF itself has no issues and is clean for use by anyone.  However, after some testing we discovered that the problem was tied to  the links contained within the pdf.  Even this was a mystery for a while as when typing in those links to check for problems, we found none.  The sites they went to were legit and had no reason to be marked as "malware".

After I came up with the idea of removing the "ACTIVE" portions of the links and discovered that once a pdf had these removed the same pdf would then test as clean,  It occurred to me that just as you can get "Spoofed" to a false link on a website, you could also get "spoofed" in a pdf if the embedded URL was NOT the same as the written URL.

Removing the "embedded" portion of the URL does not inconvenience anyone as adobe will recreate hyperlinks to any properly written URL anyway.  I don't have time to track down the Why and How anyone would do this and that is not a problem I need to deal with.  I just need the PDF's to stay available once they are filed for access.  To do this, they must pass through the AV scanners as clean.  To do that, they must not contain any links to sites that are blacklisted as being known "malware" hosts.

AVAST agrees with me that my solution would solve the problem.  Eventually i am looking for a way to head this issue off before it ends up on my desk by the hundreds.  An Outlook tool that would remove embedded links for all attachments would be preferable so that each employee would end up with only clean PDFs in the first place. 

At this point, though, I need a way to do it retroactively on thousands of archived documents.  My first thought was to ask for a way to disable the module of the Server Antivirus as it is not my place to be the "Guardian of the World" and the files themselves were safe.   But I don't want to be responsible for misdirecting someone with a hidden URL such that they click a document labelled   "Library_of_Congress_ #23Z3457231kb-N" and end up on "Malware-R-Us" instead if there I a way I could prevent it.

Since i had already found a way to fix the problem,  and the fix does not have any ill effects on the documents themselves, I am just looking for a way to automate the "cleaning" and make this my small contribution to a Safer Internet. 

To make practical, it must be done by a utility where i can set it to scan from A to Z and run 24/7 for as long as it takes to clean the entire 5 TB document Data drive and make sure that none of those stored documents contain any “hidden” redirects.  The good links will work as written anyway as the hidden links are all that will be removed.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version