Limiting what search engines can index using robots.txt

The robots.txt file is an essential tool used by webmasters to control how search engine spiders (or robots) interact with their websites. This file provides instructions to these bots on which parts of the site they are allowed or disallowed to access and index.

What the File Is Used For

The primary purpose of the robots.txt file is to manage the behaviour of search engine crawlers. While most website owners appreciate being indexed by search engines like Google, some may prefer certain pages or directories remain unindexed to protect sensitive information or reduce server load.

How to Set to Allow User-Agents/Bots/AI Bots to Access the Site

To allow all user-agents (bots) to access your entire site, you would create a robots.txt file with the following content:

User-agent: *
Allow:

This directive means that all bots (*) are permitted to crawl every part of the site because no paths are disallowed.

Blocking Malicious Bots

While legitimate bots like GoogleBot should generally be allowed access, there are malicious bots that you may want to block outright. These bots often have harmful intent, including scraping content, overloading servers, or performing fraudulent activities.

Some examples of malicious bots or user-agents that you might consider blocking include:

Mb2345Browser, LieBaoFast, MicroMessenger, and others originating from certain regions or with suspicious behavior patterns.
Scrapers: These bots extract content without permission and can overload your server.
Malicious bots involved in criminal activities such as fraud and data theft.

You can use tools like the Bad Bot Generator to generate a robots.txt file tailored to block specific malicious bots. This tool allows you to select bots you wish to block and generates the appropriate robots.txt directives for you.

Example of blocking specific malicious bots in robots.txt:

User-agent: Mb2345Browser
Disallow: /

User-agent: LieBaoFast
Disallow: /

User-agent: MicroMessenger
Disallow: /

How to Add Sitemap

Including a sitemap in your robots.txt file helps search engines find and index your content more efficiently. To add a sitemap, include a line like this:

Sitemap: http://www.example.com/sitemap.xml

This line should be placed at the top or bottom of your robots.txt file. Here’s an example of how to include a sitemap in your robots.txt file:

Sitemap: http://www.example.com/sitemap.xml

User-agent: *
Disallow: /private/

In this example, the sitemap URL is specified at the top of the file, making it easy for search engine bots to locate and use it for crawling your site.

How to Add a Block to Stop URL Params from Being Spidered

If your website uses URL parameters (e.g., for tracking, sorting, or filtering), these can create duplicate content issues or unnecessary server load if indexed. To prevent search engines from crawling URLs with parameters, you can use the following directive in your robots.txt file:

User-agent: *
Disallow: /*?*

This directive tells all bots to avoid crawling any URL that contains a question mark (?), which typically indicates the presence of query parameters.

For example:

A URL like http://www.example.com/product?id=123 would be blocked.
A URL like http://www.example.com/about would still be accessible.

How to Disallow Any Indexing - i.e., a Complete Block

If you wish to block all search engine bots from accessing and indexing any part of your site, use the following directives:

User-agent: *
Disallow: /

This command instructs all bots to refrain from accessing any content on the site.

Best Practices for `robots.txt`

Use Wildcards Carefully: While wildcards like * are useful for broad directives, they should be used cautiously to avoid accidentally blocking important content.
Test Your File: Use tools like Google Search Console to test your robots.txt file and ensure it behaves as expected.
Combine with Other Security Measures: Since robots.txt is not enforceable, pair it with server-side measures like .htaccess rules or firewalls to block malicious bots effectively.
Regularly Update: Periodically review and update your robots.txt file to reflect changes in your site structure or bot behavior.

By following these guidelines, you can create a robust robots.txt file that balances accessibility for legitimate bots with protection against malicious ones.

Example `robots.txt`

# Sitemap inclusion for better indexing
Sitemap: http://www.example.com/sitemap.xml

# Block common AI bots and malicious bots
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ByteSpider
Disallow: /

User-agent: CommonCrawl
Disallow: /

User-agent: Mb2345Browser
Disallow: /

User-agent: LieBaoFast
Disallow: /

User-agent: MicroMessenger
Disallow: /

# Allow legitimate search engine bots (Google, Bing, etc.)
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: FacebookExternalHit
Disallow:

User-agent: Twitterbot
Disallow:

# Allow image bots for social sharing
User-agent: Googlebot-Image
Disallow:

User-agent: Bingbot-Image
Disallow:

User-agent: FacebookExternalHit
Disallow:

User-agent: Twitterbot
Disallow:

# Allow media and template folders
User-agent: *
Allow: /media/
Allow: /templates/

# Disallow all other bots from crawling unnecessary files
User-agent: *
Disallow: /*?*  # Block URLs with parameters
Disallow: /private/

Explanation

Sitemap: The Sitemap directive helps search engines locate your sitemap for efficient crawling.
Blocking AI and Malicious Bots: Specific user-agents like GPTBot, CCBot, and others are blocked to prevent scraping or malicious activity.
Allowing Legitimate Bots: Search engines like Google (Googlebot), Bing (Bingbot), and social platforms like Facebook (FacebookExternalHit) and Twitter (Twitterbot) are explicitly allowed to crawl the site.
Allow Image Bots: Image-specific bots like Googlebot-Image and Bingbot-Image are allowed for social sharing purposes.
Allow Media & Template Folders: Directories like /media/ and /templates/ are explicitly allowed for access by all bots.
Disallow Unnecessary Crawling: URLs with parameters (/*?*) and private directories (/private/) are blocked to reduce server load and avoid duplicate content issues.