Configure Wagtail robots.txt and Block Search Indexing

Creating a Dynamic robots.txt

Rather than just a static text file, you can take advantage of Django template markup to generate a dynamic robots.txt for your site.

Start off by creating a view to deal with calls to robots.txt. I have a 'core' app on my sites to put global things like this.

from django.views.generic import TemplateView

class RobotsView(TemplateView):

    content_type = 'text/plain'

    def get_template_names(self):
        return 'robots.txt'

In your, add the following to your urlpatterns:

from core.views import RobotsView

urlpatterns = [
    url(r'^robots\.txt$', RobotsView.as_view(), name='robots'),

Now create the robots.txt in the root folder templates folder. At it's most basic, it might look like this:

User-Agent: *
Disallow: /admin/

# Sitemap files
Sitemap: {{}}/sitemap.xml

You could add additional code to test for which bot it is etc., anything you can do with Django templating and Python tags can be added at this stage.

Blocking Search Indexing

What not to do

Initially, I'd added my accounts pages to the robots.txt disallow list as I didn't want these indexed.

User-Agent: *
Disallow: /admin/
Disallow: /accounts/
Disallow: /en/accounts/

Shortly afterwards, Google Search Console sent me the following message:

Warnings are suggestions for improvement. Some warnings can affect your appearance on Search; some might be reclassified as errors in the future. The following warnings were found on your site:

Indexed, though blocked by robots.txt

Further investigation brought me to this page with the following warning:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

The solution? Well, apparently the disallow is old-school. Since Google will index the page before finding it disallowed in robots.txt, it then kicks up this warning. Allegedly, some search engines will just plain ignore robots.txt (nice to have consistency).

The modern and accepted method is to either use a meta-tag or http response header.

Block Indexing with the robots meta tag

Since the robots meta tag resides in the global header, it's straightforward to dynamically change the tag based on page properties or real-time testable criteria via a template tag function.

I thought about the paths I'd want to block from indexing, and realised it was any non-Wagtail page (the search and login account pages). Non-Wagtail pages won't have a self context variable associated with them. Since every page on the site uses the same head template, it was simple enough to add a template tag to deal with this:

from django.utils.safestring import mark_safe
from django import template

register = template.Library()

def robots(context):
    page = get_context_var_or_none(context, 'self')
    if not page:
        return mark_safe('<meta name="robots" content="noindex">')
    return mark_safe(
        '<meta name="robots" content="index, follow, archive, imageindex, noodp, noydir, snippet, translate, max-snippet:-1, max-image-preview:large, max-video-preview:-1">'

Anything being rendered without a self context variable will have the robots noindex meta tag added.

You'll need to use the mark_safe function otherwise Django will render it with escape characters instead of markup.

I could have used something like the following without a template tag:

{%if not self%}<meta name="robots" content="noindex">{%endif%}

The problem with this is that Django throws an error into the logs every time a non existent context variable is referenced, even from an 'if' test like this.

Instead, it just needs a call to your new template tag in the meta section of your head template to do the work:


If you need to index some of your non-Wagtail pages, just amend the template tag accordingly.

Block Indexing with HTTP Response Headers

It's also possible to add a X-Robots-Tag with a value of noindex or none to selected paths in your server config files.

How you do that will depend on which server you're using.

On NGNIX, that might be:

location /accounts {
    add_header X-Robots-Tag "noindex, follow" always; 
    try_files $uri $uri/ /*;

Don't take the above as gospel as I haven't used this method and in no way claim to be a NGNIX expert. But this at least points you in the right directions if you're going that route.

Personally, I prefer the meta tag option - it's easy to test on the dev site and know what you're getting. Getting a meta tag wrong has no chance of breaking your site also.

Use the rel="nofollow" Link Attribute

Frustratingly, Google will still index the URL's for pages that you have marked with noindex if you have links pointing to those pages. Search Console will then tell you that you have page indexing errors on your site even though it's behaving exactly as specified.

Make sure to add rel="nofollow" to those links to specifically tell search engines to ignore the link.

Even Google documentation gives conflicting instructions here: "For links within your own site, use the robots.txt disallow rule."

Follow that link and find: "Google can't index the content of pages which are disallowed for crawling, but it may still index the URL and show it in search results without a snippet. Learn how to block indexing."

Follow that link, and you end up at that message that tells you not to use the robots disallow rule ...

For what it's worth, the rel="nofollow" works fine on links within your own site.

  Please feel free to leave any questions or comments below, or send me a message here