Wagtail - Configure the robots.txt and Block Search Indexing (the correct way)

Rather than just a static text file, you can take advantage of Django template markup to generate a dynamic robots.txt for your site.

Start off by creating a view to deal with calls to robots.txt. I have a 'core' app on my sites to put global things like this.

#core/views.py
from django.views.generic import TemplateView

class RobotsView(TemplateView):

    content_type = 'text/plain'

    def get_template_names(self):
        return 'robots.txt'

In your urls.py, add the following to your urlpatterns:

from core.views import RobotsView

urlpatterns = [
    ...
    url(r'^robots\.txt$', RobotsView.as_view(), name='robots'),
    ...
]

Now create the robots.txt in the root folder templates folder. At it's most basic, it might look like this:

User-Agent: *
Disallow: /admin/

# Sitemap files
Sitemap: {{ request.site.root_url }}/sitemap.xml

You could add additional code to test for which bot it is etc., anything you can do with Django templating and Python tags can be added at this stage.

Blocking Search Indexing

Initially, I'd added my accounts pages to the robots.txt disallow list as I didn't want these indexed.

User-Agent: *
Disallow: /admin/
Disallow: /accounts/
Disallow: /en/accounts/

Shortly afterwards, Google Search Console sent me the following message:

Warnings are suggestions for improvement. Some warnings can affect your appearance on Search; some might be reclassified as errors in the future. The following warnings were found on your site:
Indexed, though blocked by robots.txt

Further investigation brought me to this page with the following warning:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

The solution? Well, apparently the disallow is old-school. Since Google will index the page before finding it disallowed in robots.txt, it then kicks up this warning. Allegedly, some search engines will just plain ignore robots.txt (nice to have consistency).

The modern and accepted method is to either use a meta-tag or http response header.

Block Indexing with the robots meta tag

Since the robots meta tag resides in the global header, it's straightforward to dynamically change the tag based on page properties or real-time testable criteria via a template tag function.

I thought about the paths I'd want to block from indexing, and realised it was any non-Wagtail page (the search and login account pages). Non-Wagtail pages won't have a self context variable associated with them. Since every page on the site uses the same head template, it was simple enough to add a template tag to deal with this:

from django.utils.safestring import mark_safe
from django import template

register = template.Library()

@register.simple_tag(takes_context=True)
def robots(context):
    page = get_context_var_or_none(context, 'self')
    if not page:
        return mark_safe('<meta name="robots" content="noindex">')
    return mark_safe('<meta name="robots" content="index, follow, archive, imageindex, odp, snippet, translate, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />')

Anything being rendered without a self context variable will have the robots noindex meta tag added.

You'll need to use the mark_safe function otherwise Django will render it with escape characters instead of markup.

I could have used something like the following without a template tag:

{% if not self %}<meta name="robots" content="noindex">{% endif %}

The problem with this is that Django throws an error into the logs every time a non existent context variable is referenced, even from an 'if' test like this.

Instead, it just needs a call to your new template tag in the meta section of your head template to do the work:

{% robots %}

If you need to index some of your non-Wagtail pages, just amend the template tag accordingly.

Block Indexing with HTTP Response Headers

It's also possible to add a X-Robots-Tag with a value of noindex or none to selected paths in your server config files.

How you do that will depend on which server you're using.

On NGNIX, that might be:

location /accounts {
    add_header X-Robots-Tag "noindex, follow" always; 
    try_files $uri $uri/ /*;
}

Don't take the above as gospel as I haven't used this method and in no way claim to be a NGNIX expert. But this at least points you in the right directions if you're going that route.

Personally, I prefer the meta tag option - it's easy to test on the dev site and know what you're getting. Getting a meta tag wrong has no chance of breaking your site also.


If you have any comments, questions or suggestions on the above, feel free to leave those in the comment section below.

 
Comments
Sign In to leave a comment.