Configure Wagtail robots.txt and Block Search Indexing

Introduction: Understanding Robots.txt for Effective Crawling

The principal purpose for updating and maintaining a robots.txt file for your website is to keep it from becoming clogged with too many crawler queries.

Crawlers visit your website and add URLs to the crawl queue. They do this for freshly found as well as already known URLs. A crawler will initially look in your website's root directory for the robots.txt file. They will crawl your entire site if it is not present. If a robots.txt file exists, they will crawl your website according to the directions you set (see note below).

Blocking Search Engine Indexing

A popular misconception is that robots.txt directives may be used to block pages from appearing in Google search results. Google can still index your content if there are additional indications, such as connections from other websites or even internal links.

Robots.txt is not a means to prevent Google from indexing your content.

Your website may suffer greatly if your robots.txt file is not configured correctly, you could inadvertently instruct crawlers not to reach your sites by mistake. The issue may become much more problematic for really big websites. You could potentially deny crawlers access to significant sections of important pages.

Additionally, not every search engine crawler will follow the instructions you've provided in your robots.txt file. The majority of trustworthy crawlers avoid accessing pages that are restricted by robots.txt. Nevertheless, certain malevolent bots could overlook it. Therefore, do not rely solely on robots.txt to safeguard your website's critical sections.

Creating a Dynamic robots.txt with Django

Rather than using a static text file, you can take advantage of Django template markup to generate a dynamic robots.txt for your site.

Start off by creating a view to deal with calls to robots.txt. I have a 'core' app on my sites to put global things like this.

core/views.py

        
            Copy
        
from django.views.generic import TemplateView

class RobotsView(TemplateView):

    content_type = 'text/plain'

    def get_template_names(self):
        return 'robots.txt'

In your urls.py, add the following to your urlpatterns:

urls.py

        
            Copy
        
    
from core.views import RobotsView

urlpatterns = [
    ...
    url(r'^robots\.txt$', RobotsView.as_view(), name='robots'),
    ...
]

Now create the robots.txt in the root folder templates folder. At it's most basic, it might look like this:

robots.txt

        
            Copy
        
User-Agent: *
Disallow: /admin/

# Sitemap files
Sitemap: {{ request.site.root_url }}/sitemap.xml

You could add additional code to test for which bot it is etc.. Anything you can do with Django templating and Python tags can be added at this stage.

Blocking Search Indexing

What not to do

Initially, I'd added my accounts pages to the robots.txt disallow list as I didn't want these indexed.

        
            Copy
        
    
User-Agent: *
Disallow: /admin/
Disallow: /accounts/
Disallow: /en/accounts/

Shortly afterwards, Google Search Console sent me the following message:

Warnings are suggestions for improvement. Some warnings can affect your appearance on Search; some might be reclassified as errors in the future. The following warnings were found on your site:

Indexed, though blocked by robots.txt

Further investigation brought me to this page with the following warning:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

The solution? Well, apparently the disallow is old-school. Since Google will index the page before finding it disallowed in robots.txt, it then kicks up this warning. Allegedly, some search engines will just plain ignore robots.txt (nice to have consistency).

The modern and accepted method is to either use a meta-tag or http response header.

Block Indexing with the robots meta tag

Since the robots meta tag resides in the global header, it's straightforward to dynamically change the tag based on page properties or real-time testable criteria via a template tag function.

I thought about the paths I'd want to block from indexing, and realised it was any non-Wagtail page (the search and login account pages). Non-Wagtail pages won't have a self context variable associated with them. Since every page on the site uses the same head template, it was simple enough to add a template tag to deal with this:

        
            Copy
        
    
from django.utils.safestring import mark_safe
from django import template

register = template.Library()

@register.simple_tag(takes_context=True)
def robots(context):
    page = get_context_var_or_none(context, 'self')
    if not page:
        return mark_safe('<meta name="robots" content="noindex">')
    return mark_safe(
        '<meta name="robots" content="index, follow, archive, imageindex, noodp, noydir, snippet, \
         translate, max-snippet:-1, max-image-preview:large, max-video-preview:-1">'
        )

Anything being rendered without a self context variable will have the robots noindex meta tag added.

You'll need to use the mark_safe function otherwise Django will render it with escape characters instead of markup.

I could have used something like the following without a template tag:

        
            Copy
        
{% if not self %}<meta name="robots" content="noindex">{% endif %}

The problem with this is that Django throws an error into the logs every time a non existent context variable is referenced, even from an 'if' test like this.

Instead, it just needs a call to your new template tag in the <head> section of your head template to do the work:

        
            Copy
        
{% robots %}

If you need to index some of your non-Wagtail pages, just amend the template tag accordingly.

Block Indexing with HTTP Response Headers

It's also possible to add a X-Robots-Tag with a value of noindex or none to selected paths in your server config files.

How you do that will depend on which server you're using.

On NGNIX, that might be:

        
            Copy
        
    
location /accounts {
    add_header X-Robots-Tag "noindex, follow" always; 
    try_files $uri $uri/ /*;
}

Don't take the above as gospel as I haven't used this method and in no way claim to be a NGNIX expert. But this at least points you in the right directions if you're going that route.

Personally, I prefer the meta tag option - it's easy to test on the dev site and know what you're getting. Getting a meta tag wrong has no chance of breaking your site also.

Use the rel="nofollow" Link Attribute

Frustratingly, Google will still index the URL's for pages that you have marked with noindex if you have links pointing to those pages. Search Console will then tell you that you have page indexing errors on your site even though it's behaving exactly as specified.

Make sure to add rel="nofollow" to those links to specifically tell search engines to ignore the link.

Even Google documentation gives conflicting instructions here: "For links within your own site, use the robots.txt disallow rule."

Follow that link and find: "Google can't index the content of pages which are disallowed for crawling, but it may still index the URL and show it in search results without a snippet. Learn how to block indexing."

Follow that link, and you end up at that message that tells you not to use the robots disallow rule ...

For what it's worth, the rel="nofollow" works fine on links within your own site.

Testing Your robots.txt

There are quite a few online tools available for testing robots.txt configuration. Logeix have a good one that you can either fetch or paste in your code to test.

The tool will validate your syntax and allow you to test URL's. Any rule that applies to the URL will be highlighted. There is also option to test against various user agents.

Robots.txt Testing Tool: Validate your Robots.txt File - Logeix

Ensure your site is crawler-friendly with our Robots.txt Tester. Verify and optimize your robots.txt file, and boost your site's visibility in search results.

Open Logeix Testing Tool

Conclusion

In this tutorial, we delved into crafting a dynamic robots.txt for your Django site, transforming it from a static file to a template-driven, adaptable tool. Key takeaways include:

Creating Dynamic robots.txt: Utilize Django template markup to render a dynamic robots.txt, injecting flexibility into your site's directives.
View and URL Configuration: Establish a dedicated view for handling robots.txt calls, seamlessly integrating it into your site's URL patterns.
Effective Disallow Strategies: Learn the nuances of blocking search indexing efficiently. Explore the pitfalls of using robots.txt disallow and embrace modern alternatives like meta tags and HTTP response headers.
Meta Tags for Noindexing: Implement the robots meta tag with a custom template tag function, dynamically adjusting based on page properties.
HTTP Response Headers: Explore the option of adding X-Robots-Tag in your server config files for specific paths.
Link Attributes: Understand the importance of rel="nofollow" in links to prevent search engines from indexing specific pages.

By mastering these techniques, you can effectively manage how search engines interact with your site, ensuring optimal visibility and adherence to your indexing preferences. Boost your SEO strategy with informed choices in robots.txt configuration.