topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Monday August 3, 2020, 4:10 am
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: What is the best format for preventing googlebot (et al) from indexing a website  (Read 405 times)

questorfla

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 564
  • Fighting Slime all the Time
    • View Profile
    • Donate to Member
Robots.txt always seemed to do the job but we have recently found that it no longer seems to work as well as it used to.  Various documents and other items were recently found in a Google web cache that in theory should not have been there.  I just wondered if there was any better way to prevent having every file in every site posted somewhere in a web cache?  Is robots.txt still the only and best thing to use for an apache website?  if possible I would like for Google to forget the whole domain exists as it contains private files that are in a preliminary stage. No need to for anyone to ever be able to find them through a web search

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,339
    • View Profile
    • Donate to Member
Isn't .htaccess the best way to ameliorate website crawling?

An example of how is on AskApache: https://www.askapach...apers-with-htaccess/

You could add Google to that list, (and any other search engine).

questorfla

  • Supporting Member
  • Joined in 2012
  • **
  • Posts: 564
  • Fighting Slime all the Time
    • View Profile
    • Donate to Member
Asking Apache is going to be the next step for sure. 
I need to know what software doesn't work with Apache's .htaccess password control as that is why the problem came up to start with.   

A couple of sites that we ran for this one client would not work for them becuase their systems blocked any sites we made that used .htaccess to ask for  a password.  We have made other sites for the same people using our Old client site software which also had login/passsword requirements.  But it used a MySql database which controlled all aspects of the site contents.  They had no problem using that setup. 

But when they tried to access a site guarded by the Apache .htaccess setup, their systems threw up all kinds of security warnings unless we removed the .htaccess login requirements.  This only occurs with that single client so it must be some security arrangement they use on their systems that doesnt want to work with Apache .htaccess.

However, removing that leaves the sites Open Access to anyone and apparently the robots.txt which worked fine for all these years (and still does work if the sites are password protected) isnt enough to prevent indexing of contents by Googlebot. 

Bottom lione is that as is, we cant password protect that client's sites anyway so finding out why they wont accept .htaccess is the biggest part of the problem

Shades

  • Member
  • Joined in 2006
  • **
  • Posts: 2,677
    • View Profile
    • Donate to Member
An idea:
The Google-bot uses the Apache web server. The web server has been started on the server using a specific user account on that server. The NTFS file system has the option to deny access to files. You could set the 'deny access' option for the files you don't want indexed for the Apache's user account. This should not affect normal access to these files through file shares. Of course if the maintainers of these files try to access them through the web server, it would fail for them as well.

Kinda brutal, but inaccessible files are very hard to index, even for the current and future trickery Google implements in their Google-bot.

If you run Apache on a Linux server, file access management is usually easier. Still, the maintainer trying to access files through the company web server will also be a problem with Linux.

Google not indexing your sites, that will hurt their SEO score and therefore placement in the Google search results. If that is not an issue or concern, you could consider the idea above.

flamerz

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 140
    • View Profile
    • Donate to Member
Save all as images  8)

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,339
    • View Profile
    • Donate to Member
Asking Apache is going to be the next step for sure. 
I need to know what software doesn't work with Apache's .htaccess password control as that is why the problem came up to start with.   

A couple of sites that we ran for this one client would not work for them becuase their systems blocked any sites we made that used .htaccess to ask for  a password.

You don't use .htaccess for login/password control, you use it to filter out domains that shouldn't have access to the site.

Domains filtered get redirected to an error 403 document.