How to block bad spiders from wasting bandwidth
This article was posted in: Web Hosting
Search engines use technology known as spiders to search the web (nice, huh?). A spider is an agent (also called a bot - short for robot) that will connect to your website and download a copy of all of your pages (or try to) in order to populate the search engine it is working for. However, spiders are supposed to obey certain rules, and they are definitely not supposed to thrash your website to the point that is causes a denial of service, or uses up all of your bandwidth.
By adding special instructions to a file called .htaccess (the full stop in front of it is intentional) you can instruct your web server to deny requests from specific spiders.
Solution 1 - ban by IP address
If a file does not already exist at public_html/.htaccess you can create an empty one.
Add this to the top of the file, replacing x.x.x.x with the IP address of the bad spider bot.
allow from all
deny from x.x.x.x
Very often bots use a range of IP addresses. For example Baiduspider, a Chinese spider which causes many of our customers to experience problems, appears to use a range of addresses from 184.108.40.206 to 220.127.116.11 to spider sites in the UK. In order to completely block this range, you can add:
allow from all
deny from 18.104.22.168/24
deny from 22.214.171.124/24
If you only want to apply these rules to a particular directory path within your website, then you can add
allow from all
deny from 126.96.36.199
This would block 188.8.131.52 from being able to access http://yourwebsite.com/documents/notforbots
You can read more about the apache 2.2 mod_access directives here.
Solution 2 - ban by User Agent
If you know how the spider is identifying itself when you can block requests on the basis of the User-Agent HTTP request header.
So, how do you find out what User-Agent is hitting your site so hard? If you look in your raw apache logs in the Logs section of cPanel.
Then you can download the logs that have been collected so far today, by clicking on the domain in question. Once you have downloaded, and uncompressed the .gz file you will have to load the file up in a text editor and do some detective work. Some people use Excel or OpenOffice or other spreadsheet software to parse the fields in the file. However, this is an advanced article so we're going to assume you know how to do that!
184.108.40.206 - - [22/Jul/2013:20:07:48 +0100] "GET /special-events/action:month/cat_ids:9/tag_ids:37,26/ HTTP/1.0" 500 7309 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
Each line entry will look a little like the above. The last quote delimited string is the User-Agent header:
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
The bit we are interested in is Baiduspider/2.0.
We're not really interested in which version of Baiduspider is hitting us, so we're just going to block everything that matches Baiduspider in the User-Agent header. To do this, we would add this to the top of our .htaccess file
BrowserMatchNoCase baiduspider banned
Deny from env=banned
This would block all requests from the Baiduspider bot, as long as it issued it's tell take User-Agent header.
You can read more about the apache mod_setenvif module directives here.