ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > General Software Discussion

What is the best format for preventing googlebot (et al) from indexing a website

(1/2) > >>

questorfla:
Robots.txt always seemed to do the job but we have recently found that it no longer seems to work as well as it used to.  Various documents and other items were recently found in a Google web cache that in theory should not have been there.  I just wondered if there was any better way to prevent having every file in every site posted somewhere in a web cache?  Is robots.txt still the only and best thing to use for an apache website?  if possible I would like for Google to forget the whole domain exists as it contains private files that are in a preliminary stage. No need to for anyone to ever be able to find them through a web search

4wd:
Isn't .htaccess the best way to ameliorate website crawling?

An example of how is on AskApache: https://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess/

You could add Google to that list, (and any other search engine).

questorfla:
Asking Apache is going to be the next step for sure. 
I need to know what software doesn't work with Apache's .htaccess password control as that is why the problem came up to start with.   

A couple of sites that we ran for this one client would not work for them becuase their systems blocked any sites we made that used .htaccess to ask for  a password.  We have made other sites for the same people using our Old client site software which also had login/passsword requirements.  But it used a MySql database which controlled all aspects of the site contents.  They had no problem using that setup. 

But when they tried to access a site guarded by the Apache .htaccess setup, their systems threw up all kinds of security warnings unless we removed the .htaccess login requirements.  This only occurs with that single client so it must be some security arrangement they use on their systems that doesnt want to work with Apache .htaccess.

However, removing that leaves the sites Open Access to anyone and apparently the robots.txt which worked fine for all these years (and still does work if the sites are password protected) isnt enough to prevent indexing of contents by Googlebot. 

Bottom lione is that as is, we cant password protect that client's sites anyway so finding out why they wont accept .htaccess is the biggest part of the problem

Shades:
An idea:
The Google-bot uses the Apache web server. The web server has been started on the server using a specific user account on that server. The NTFS file system has the option to deny access to files. You could set the 'deny access' option for the files you don't want indexed for the Apache's user account. This should not affect normal access to these files through file shares. Of course if the maintainers of these files try to access them through the web server, it would fail for them as well.

Kinda brutal, but inaccessible files are very hard to index, even for the current and future trickery Google implements in their Google-bot.

If you run Apache on a Linux server, file access management is usually easier. Still, the maintainer trying to access files through the company web server will also be a problem with Linux.

Google not indexing your sites, that will hurt their SEO score and therefore placement in the Google search results. If that is not an issue or concern, you could consider the idea above.

flamerz:
Save all as images  8)

Navigation

[0] Message Index

[#] Next page

Go to full version