topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Saturday December 14, 2024, 12:25 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: Desktop tool for programmatically removing URL's from PDF files  (Read 7284 times)

questorfla

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 570
  • Fighting Slime all the Time
    • View Profile
    • Donate to Member
Is anyone aware of a tool that can be run against a system full of stored pdf's to achieve this action?
I have seen several online sites that offer this and while they do work, there is no way i could manage to upload, scan, then download and rename thousands of files.

The problems are being caused by what are likely 'spoofed' URLs in the first place.  The written TEXT may say www.google.com but the embedded URL's take you to www.goggles.com or some other.  Normally, not a big deal, but our AV program is marking hundreds of these files every day as being possibly 'malicious' and quarantining them due to links associated with Malware of one type or another.  The pdf files are safe but the links are blacklisted.  And I have to assume that the embedded URL's "Might" really be misdirecting people to real malicious sites.

Either way, I ran a few of them through an online URL remover and the resulting document scans as clean which is all I need.  Even though all the written text for every URL's is still in place and can still be clicked as Adobe automatically links the written URL's anyway.  At least the files can be accessed.  When quarantined, they are not available to anyone

While I found a lot of advice on how to manually do this one file at a time using Acrobat, I have not found a single tool either for sale or free that can be run against an archive storage containing thousands of pdf files which is what must be done. 

If anyone has ever seen such a tool or know where i could check, the info would be appreciated.  Preferably this would be a tool that could be run at a desktop level without having to invoke Acrobat

 

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #1 on: August 25, 2016, 03:18 AM »
You could try the script I did for jity2 over here: DONE: batch print silently pdf-2-pdf using SumatraPdf and Bullzip

Possibly an option in Sumatra or Bullzip to not incorporate URLs.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #2 on: August 25, 2016, 09:50 AM »
Sorry, I have to ask: "But why?"

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,190
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #3 on: August 25, 2016, 10:24 AM »
He explained it very well, I thought.  Sometimes, people link the words http://www.google.com to http://www.malwareRus.com in PDFs.  He wants to strip out the links.

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,778
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #4 on: August 25, 2016, 12:58 PM »
Perhaps a better question is, "Why do you have hundreds of PDFs with deceptive links to malware?"

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,190
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #5 on: August 25, 2016, 02:31 PM »
From the questions he's asked before related to his company, I don't question that.  ;D  He seems to have a hell of a group of cats to herd.

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,939
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #6 on: August 26, 2016, 12:22 AM »
https://appligent.com/server-software/redax-enterprise-server/

Not for free, but the redax server product of the linked company looks to be up to your task.

dr_andus

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 851
    • View Profile
    • Dr Andus's toolbox
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #7 on: August 26, 2016, 12:53 AM »

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,939
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #8 on: August 26, 2016, 08:53 AM »

questorfla

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 570
  • Fighting Slime all the Time
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #9 on: August 26, 2016, 10:37 AM »
He explained it very well, I thought.  Sometimes, people link the words http://www.google.com to http://www.malwareRus.com in PDFs.  He wants to strip out the links.

Thanks Wraith.  That is exactly the case.

Each employee gets submissions from hundreds of people every day.  These contain attachments with things like biographies, resumes, etc. all having multiple references to various material posted on the web.  There is no way anyone could check each submission for every link to be sure that each goes to where the TEXT says it does.  Due to some of the other comments here, I am going to provide the LONG version.

Unfortunately, (or perhaps not) AVAST can (and does) do exactly that.  On the system where all these submissions are stored I run a constant full-time scan on all the contents of the 5 TB Data drive.  This scan returns anywhere from a couple dozen to a couple hundred entries per day of files that are quarantined.  It took both myself and the AVAST techs some digging to understand why these pdf's were being removed at all as they are simply flagged as  pdf:UrlMal-inf [Trj] or some other Malware

It turned out that the PDF itself has no issues and is clean for use by anyone.  However, after some testing we discovered that the problem was tied to  the links contained within the pdf.  Even this was a mystery for a while as when typing in those links to check for problems, we found none.  The sites they went to were legit and had no reason to be marked as "malware".

After I came up with the idea of removing the "ACTIVE" portions of the links and discovered that once a pdf had these removed the same pdf would then test as clean,  It occurred to me that just as you can get "Spoofed" to a false link on a website, you could also get "spoofed" in a pdf if the embedded URL was NOT the same as the written URL.

Removing the "embedded" portion of the URL does not inconvenience anyone as adobe will recreate hyperlinks to any properly written URL anyway.  I don't have time to track down the Why and How anyone would do this and that is not a problem I need to deal with.  I just need the PDF's to stay available once they are filed for access.  To do this, they must pass through the AV scanners as clean.  To do that, they must not contain any links to sites that are blacklisted as being known "malware" hosts.

AVAST agrees with me that my solution would solve the problem.  Eventually i am looking for a way to head this issue off before it ends up on my desk by the hundreds.  An Outlook tool that would remove embedded links for all attachments would be preferable so that each employee would end up with only clean PDFs in the first place. 

At this point, though, I need a way to do it retroactively on thousands of archived documents.  My first thought was to ask for a way to disable the module of the Server Antivirus as it is not my place to be the "Guardian of the World" and the files themselves were safe.   But I don't want to be responsible for misdirecting someone with a hidden URL such that they click a document labelled   "Library_of_Congress_ #23Z3457231kb-N" and end up on "Malware-R-Us" instead if there I a way I could prevent it.

Since i had already found a way to fix the problem,  and the fix does not have any ill effects on the documents themselves, I am just looking for a way to automate the "cleaning" and make this my small contribution to a Safer Internet. 

To make practical, it must be done by a utility where i can set it to scan from A to Z and run 24/7 for as long as it takes to clean the entire 5 TB document Data drive and make sure that none of those stored documents contain any “hidden” redirects.  The good links will work as written anyway as the hidden links are all that will be removed.

IainB

  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,544
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #10 on: August 26, 2016, 10:54 AM »
He explained it very well, I thought.  Sometimes, people link the words http://www.google.com to http://www.malwareRus.com in PDFs.  He wants to strip out the links.
________________________
Duh!
Thankyou, and yes, he does explain it very well.
I should have read his entire post rather than just the end bit and 4wd's response.
My apologies. More haste, less speed...    :-[

questorfla

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 570
  • Fighting Slime all the Time
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #11 on: August 26, 2016, 11:24 PM »
Thanks to all for the suggestions.  I am looking into both Dr. Andus and Shades links.  Surely one of them can do something this simple.   If not maybe one of the online tools has a desktop version.  The one i used PDF-du seems to but they sell every "piece" of their tool "A la Carte'.  It looks like DEBENU is a full suite and they even have a server version.  (Though this is not for a server OS, just a system with lots of storage).  And Shades offering comes from a company i have heard of and has been around for a log time.  so of course, it costs the most . 
Out of all of this, i hope i can find a permanent fix.  The one-time clean-up is necessary, but the full-time fix is to clean the files when they come in.  I hope his doesn't become a task that must be done on the file-server after the fact.  Much better as an add-in for Outlook so they get stripped clean before i have to deal with them.

Anyway, thanks for all the pointers.  I am kind of surprised that this hasn't ever happened to anyone else before. Or maybe it has just not to anyone who hangs out at DC.  AVAST told me that they have been getting a lot of calls from people running into this kind of thing so perhaps the scanning of embedded URL's against some sort of blacklist for websites is a more recent addition?

And before i forget, 4WD's idea is not so totally off the mark.  I just dont know how long it would take to do it that way.  I did see a few references to other people  sanitizing PDF's via printing them to a virtual PDF printer program such as BullZip.   I know for sure it it could be done 4WD is Script-King from everything he has ever sent me in the past.  This would be one of those "For (f) = 1 to 1 gazillion)... etc etc"  type runs.  Take each pdf, rename it slightly and print it to a new pdf file with the original filename and delete the renamed copy (or save it long enough to test a few and be sure the copy is really equal to the original), then move on the next.  Worst case, that may be my only way out.

With my luck, all of this will turn out to be caused by "False Positives" on an ever-changing Blacklist of URL sites :(

Which is why i would feel best if ALL Hidden URL's were removed.  I would prefer that any links go to exactly what people can see.  IF they choose to go to them anyway, by removing the ACTIVE part, it forces ADOBE to display that warning about "Being sure you know where this link is taking you" etc. At that point, i think i will have done as much as anyone could expect as to warning people to know where they are browsing to.

 
« Last Edit: August 26, 2016, 11:54 PM by questorfla »

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,939
    • View Profile
    • Donate to Member
Re: Desktop tool for programmatically removing URL's from PDF files
« Reply #12 on: August 27, 2016, 09:41 AM »
Do you use DNS servers from your ISP or public ones?

A public one, such as OpenDNS (P: 208.67.222.222, S: 208.67.220.220) has (free and commercial) filter options that will help protect your users from themselves when they click on links to bad sites.

But if you are a diehard, run your own/company DNS server and have much more control over what sites your users are able to visit at any given moment.
Or if you channel internet access from all your users through a hardware or software router device, try to find out if it supports the use of "blacklists". If that is the case, add whatever site to that blacklist and your users are protected as well. Do keep that "blacklist" up-to-date though. All of these pointers do not require you to batch edit PDF files for bad/hidden links. You might want to take inventory of new bad links in new pdf files that come in by batch/manual processing these.

Some routers even let you make a custom "landing" HTML page that is served to a user who tries to visit any link in that blacklist.

For example: I use an old AMD dual core based white-box PC with 2GByte of RAM and 3 network cards in combination with OPNsense router software (FreeBSD). I have 2 different ISPs, each using one network card, the last network card is used to connect to a big switch that provides internet to all computers  hooked up to that switch (by cable or access points). A rather basic setup...but I like things simple. ;)

You manage this OPNsense router software in your browser and gives you a lot of control about what your users can or cannot do on the either the network or the internet at any given time. Routing, NAT, Firewall, DNS, DHCP, Blacklists, VPN, graphical overviews of (current and historical) traffic, it does everything. The amount of options it comes with, might be overwhelming at first, but once it is setup as you like it...you don't want to use any other system anymore. And there are free/open source/commercial expansions available if the standard functionality isn't enough for your intents and purposes.

The OPNsense router software is a fork of the pfSense router software, which is in turn a fork of the mO0nwall software. If you are interested in playing with these software routers, there is a lot of forums and instructional videos available for free support. Especially for pfSense. But if so inclined, you also are able to buy books and premium support for whatever exotic wish (regarding network setup) you might have.

My choice of OPNsense 16.1 over pfSense 2.3 is mainly its interface. Although pfSense version 2.3 has a drastically improved interface, I just like the OPNsense one better.