ISPA Winner 2018 16 Years of Krystal

Web Hosting

Please find below our list of Web Hosting guides to help with your hosting account.

How to block bad spiders from wasting bandwidth

This article was posted in: Web Hosting

The problem

Search engines use technology known as spiders to search the web (nice, huh?). A spider is an agent (also called a bot - short for robot) that will connect to your website and download a copy of all of your pages (or try to) in order to populate the search engine it is working for. However, spiders are supposed to obey certain rules, and they are definitely not supposed to thrash your website to the point that is causes a denial of service, or uses up all of your bandwidth.

By adding special instructions to a file called .htaccess (the full stop in front of it is intentional) you can instruct your web server to deny requests from specific spiders.

Solution 1 - ban by IP address

If a file does not already exist at public_html/.htaccess you can create an empty one.

Add this to the top of the file, replacing x.x.x.x with the IP address of the bad spider bot.

order allow,deny
allow from all
deny from x.x.x.x

Very often bots use a range of IP addresses. For example Baiduspider, a Chinese spider which causes many of our customers to experience problems, appears to use a range of addresses from to to spider sites in the UK. In order to completely block this range, you can add:

order allow,deny
allow from all
deny from
deny from

If you only want to apply these rules to a particular directory path within your website, then you can add

order allow,deny
allow from all
deny from

This would block from being able to access

You can read more about the apache 2.2 mod_access directives here.

Solution 2 - ban by User Agent

If you know how the spider is identifying itself when you can block requests on the basis of the User-Agent HTTP request header.

So, how do you find out what User-Agent is hitting your site so hard? If you look in your raw apache logs in the Logs section of cPanel.

Then you can download the logs that have been collected so far today, by clicking on the domain in question. Once you have downloaded, and uncompressed the .gz file you will have to load the file up in a text editor and do some detective work. Some people use Excel or OpenOffice or other spreadsheet software to parse the fields in the file. However, this is an advanced article so we're going to assume you know how to do that! - - [22/Jul/2013:20:07:48 +0100] "GET /special-events/action:month/cat_ids:9/tag_ids:37,26/ HTTP/1.0" 500 7309 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +"

Each line entry will look a little like the above. The last quote delimited string is the User-Agent header:

"Mozilla/5.0 (compatible; Baiduspider/2.0; +"

The bit we are interested in is Baiduspider/2.0.

We're not really interested in which version of Baiduspider is hitting us, so we're just going to block everything that matches Baiduspider in the User-Agent header. To do this, we would add this to the top of our .htaccess file

BrowserMatchNoCase baiduspider banned
Deny from env=banned

This would block all requests from the Baiduspider bot, as long as it issued it's tell take User-Agent header.

You can read more about the apache mod_setenvif module directives here.