Limiting what search engines can index using robots.txt
The robots.txt
file is an essential tool used by webmasters to control how search engine spiders (or robots) interact with their websites. This file provides instructions to these bots on which parts of the site they are allowed or disallowed to access and index.
What the File Is Used For
The primary purpose of the robots.txt
file is to manage the behaviour of search engine crawlers. While most website owners appreciate being indexed by search engines like Google, some may prefer certain pages or directories remain unindexed to protect sensitive information or reduce server load.
How to Set to Allow User-Agents/Bots/AI Bots to Access the Site
To allow all user-agents (bots) to access your entire site, you would create a robots.txt
file with the following content:
User-agent: *
Allow:
This directive means that all bots (*
) are permitted to crawl every part of the site because no paths are disallowed.
Blocking Malicious Bots
While legitimate bots like GoogleBot should generally be allowed access, there are malicious bots that you may want to block outright. These bots often have harmful intent, including scraping content, overloading servers, or performing fraudulent activities.
Some examples of malicious bots or user-agents that you might consider blocking include:
- Mb2345Browser, LieBaoFast, MicroMessenger, and others originating from certain regions or with suspicious behavior patterns.
- Scrapers: These bots extract content without permission and can overload your server.
- Malicious bots involved in criminal activities such as fraud and data theft.
You can use tools like the Bad Bot Generator to generate a robots.txt
file tailored to block specific malicious bots. This tool allows you to select bots you wish to block and generates the appropriate robots.txt
directives for you.
Example of blocking specific malicious bots in robots.txt
:
User-agent: Mb2345Browser
Disallow: /
User-agent: LieBaoFast
Disallow: /
User-agent: MicroMessenger
Disallow: /
How to Add Sitemap
Including a sitemap in your robots.txt
file helps search engines find and index your content more efficiently. To add a sitemap, include a line like this:
Sitemap: http://www.example.com/sitemap.xml
This line should be placed at the top or bottom of your robots.txt
file. Here’s an example of how to include a sitemap in your robots.txt
file:
Sitemap: http://www.example.com/sitemap.xml
User-agent: *
Disallow: /private/
In this example, the sitemap URL is specified at the top of the file, making it easy for search engine bots to locate and use it for crawling your site.
How to Add a Block to Stop URL Params from Being Spidered
If your website uses URL parameters (e.g., for tracking, sorting, or filtering), these can create duplicate content issues or unnecessary server load if indexed. To prevent search engines from crawling URLs with parameters, you can use the following directive in your robots.txt
file:
User-agent: *
Disallow: /*?*
This directive tells all bots to avoid crawling any URL that contains a question mark (?
), which typically indicates the presence of query parameters.
For example:
- A URL like
http://www.example.com/product?id=123
would be blocked. - A URL like
http://www.example.com/about
would still be accessible.
How to Disallow Any Indexing - i.e., a Complete Block
If you wish to block all search engine bots from accessing and indexing any part of your site, use the following directives:
User-agent: *
Disallow: /
This command instructs all bots to refrain from accessing any content on the site.
Best Practices for robots.txt
- Use Wildcards Carefully: While wildcards like
*
are useful for broad directives, they should be used cautiously to avoid accidentally blocking important content. - Test Your File: Use tools like Google Search Console to test your
robots.txt
file and ensure it behaves as expected. - Combine with Other Security Measures: Since
robots.txt
is not enforceable, pair it with server-side measures like.htaccess
rules or firewalls to block malicious bots effectively. - Regularly Update: Periodically review and update your
robots.txt
file to reflect changes in your site structure or bot behavior.
By following these guidelines, you can create a robust robots.txt
file that balances accessibility for legitimate bots with protection against malicious ones.
Example robots.txt
# Sitemap inclusion for better indexing
Sitemap: http://www.example.com/sitemap.xml
# Block common AI bots and malicious bots
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ByteSpider
Disallow: /
User-agent: CommonCrawl
Disallow: /
User-agent: Mb2345Browser
Disallow: /
User-agent: LieBaoFast
Disallow: /
User-agent: MicroMessenger
Disallow: /
# Allow legitimate search engine bots (Google, Bing, etc.)
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: FacebookExternalHit
Disallow:
User-agent: Twitterbot
Disallow:
# Allow image bots for social sharing
User-agent: Googlebot-Image
Disallow:
User-agent: Bingbot-Image
Disallow:
User-agent: FacebookExternalHit
Disallow:
User-agent: Twitterbot
Disallow:
# Allow media and template folders
User-agent: *
Allow: /media/
Allow: /templates/
# Disallow all other bots from crawling unnecessary files
User-agent: *
Disallow: /*?* # Block URLs with parameters
Disallow: /private/
Explanation
- Sitemap: The
Sitemap
directive helps search engines locate your sitemap for efficient crawling. - Blocking AI and Malicious Bots: Specific user-agents like
GPTBot
,CCBot
, and others are blocked to prevent scraping or malicious activity. - Allowing Legitimate Bots: Search engines like Google (
Googlebot
), Bing (Bingbot
), and social platforms like Facebook (FacebookExternalHit
) and Twitter (Twitterbot
) are explicitly allowed to crawl the site. - Allow Image Bots: Image-specific bots like
Googlebot-Image
andBingbot-Image
are allowed for social sharing purposes. - Allow Media & Template Folders: Directories like
/media/
and/templates/
are explicitly allowed for access by all bots. - Disallow Unnecessary Crawling: URLs with parameters (
/*?*
) and private directories (/private/
) are blocked to reduce server load and avoid duplicate content issues.