Configuring a Dynamic Sitemap on Wagtail

Introduction

A sitemap lists a website’s most important pages, making sure search engines can find and crawl them. It's important to keep your sitemap up to date for optimal SEO. With a quick bit of coding, you can set your sitemap to be created dynamically on demand, ensuring it always reflects the latest content.

Creating a dynamic sitemap for your site is straightforward in Wagtail. A fresh copy will be rendered each time it is requested ensuring it reflects the current content. After the brief setup, and without additional coding, this will crawl all the live Wagtail pages in the default language for your site. For multi-lingual sites, see the final section on how to deal with this.

Creating a Dynamic Sitemap with Wagtail Sitemaps View

To your base.py add Django and Wagtail sitemaps to your installed apps:

'wagtail.contrib.sitemaps',
'django.contrib.sitemaps',

In your site's root urls.py add the following import:

from wagtail.contrib.sitemaps.views import sitemap

and in the urlpatterns, above the catch-all:

url(r'^sitemap.xml$', sitemap),

Now, browsing to example.com/sitemap.xml shows something similar to:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2021-05-17</lastmod>
  </url>
  <url>
    <loc>https://example.com/contact/</loc>
    <lastmod>2021-05-17</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/</loc>
    <lastmod>2021-06-02</lastmod>
  </url>
  <url>
    <loc>https://example.com/services/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
  <url>
    <loc>https://example.com/about/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
  <url>
    <loc>https://example.com/privacy/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
</urlset>

You can see the auto-generated sitemap.xml for this site here.

Adding Support for Routable Pages

If you're using routable pages on your site, you might want to add these as well.

Go to each class with routable pages and override the default get_sitemap_urls method called for each page. Add the following method to the class:

Class SomeRoutablePage(Page):
    ....

    def get_sitemap_urls(self, request):
        sitemap = super().get_sitemap_urls(request)
        sitemap.append(
            {
                "location": self.full_url + self.reverse_subpage('routable_page_name'),
                "lastmod": self.last_published_at or self.latest_revision_created_at,
            }
        )
        return sitemap

Hiding Pages from the Sitemap

If, for some reason, you have a page class that you don’t want to show in the sitemap (any pages that you don’t want indexed, or an empty redirect page), override get_sitemap_urls and return an empty set:

def get_sitemap_urls(self, request):
    return[]

Changing the lastmod, changefreq and priority fields

The method for getting the lastmod value may not be appropriate if the page content is updated by external source, it may be better to search when the latest item in a list would be for example.

On the blog listing page, we might want to set the lastmod field to the date of the most recent blog post update.

You could also add items to the dictionary to give the page priority and a changefreq value:

class BlogListingPage(SEOPage):
    ...

    def get_sitemap_urls(self, request):
        sitemap = super().get_sitemap_urls(request)
        lastmod_blog = self.get_children().defer_streamfields().live().public().order_by('latest_revision_created_at').last()
        sitemap.append(
            {
                "location": self.full_url,
                "lastmod": lastmod_blog.latest_revision_created_at,
                "changefreq": "weekly",
                "priority": 0.3
            }
        )
        return sitemap
<url>
    <loc>http://example.com/blog/</loc>
    <lastmod>2021-12-20</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.3</priority>
</url>
fa-regular fa-pen-to-square fa-lg It's worth reading through the notes on priority and changefreq on sitemaps.org before using these fields.

Enabling sitemap properties on a per page basis

The above methods for configuring sitemap properties work on a site-wide or per class basis. What if you want to just edit properties for a particular pages without affecting the entire class?

Here, I'm going to return to the SEOPageMixin model I introduced in an earlier blog which all my pages inherit. Where you add these is entirely up to how your site is configured, adjust as suits you.

Add Page class properties

I'll add three fields to my mixin:

  1. a Boolean field to instruct my sitemap view whether to include the page
  2. an optional CharField with choices to indicate the change frequency
  3. an optional DecimalField to indicate page priority (max_digits=2, decimal_places=1 which limits the range from 0.0 to 1.0)

The panels are added to the end of the Promote tab.

Add methods

There are two methods to add at this point:

  1. lastmod property is here to make overriding this at class level easier, you only need to redefine this property rather than the whole get_sitemap_urls method
  2. get_sitemap_urls overrides the default method from Wagtail's sitemap as discussed previously and takes the new custom fields into account. changefreq and priority are only added if they exist in the page instance. If the page instance has search_engine_index=False, an empty array is returned to skip adding it to the sitemap.
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ....
    search_engine_index = models.BooleanField(
        blank=False,
        null=False,
        default=True,
        verbose_name=_("Allow search engines to index this page?")
    )

    search_engine_changefreq = models.CharField(
        max_length=25,
        choices=[
            ("always", _("Always")),
            ("hourly", _("Hourly")),
            ("daily", _("Daily")),
            ("weekly", _("Weekly")),
            ("monthly", _("Monthly")),
            ("yearly", _("Yearly")),
            ("never", _("Never")),
        ],
        blank=True,
        null=True,
        verbose_name=_("Search Engine Change Frequency (Optional)"),
        help_text=_("How frequently the page is likely to change? (Leave blank for default)")
    )

    search_engine_priority = models.DecimalField(
        max_digits=2, 
        decimal_places=1,
        blank=True,
        null=True,
        verbose_name=_("Search Engine Priority (Optional)"),
        help_text=_("The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. (Leave blank for default)")
    )
    ....
    promote_panels = [
        ...
        MultiFieldPanel([
            FieldPanel('search_engine_index'),
            FieldPanel('search_engine_changefreq'),
            FieldPanel('search_engine_priority'),
        ], _("Search Engine Indexing")),
    ]
    ....
    @property
    def lastmod(self):
        return self.last_published_at or self.latest_revision_created_at

    def get_sitemap_urls(self, request):
        sitemap = super().get_sitemap_urls(request)
        if self.search_engine_index:
            url_item = {
                "location": self.full_url,
                "lastmod": self.lastmod
            }
            if self.search_engine_changefreq:
                url_item["changefreq"] = self.search_engine_changefreq
            if self.search_engine_priority:
                url_item["priority"] = self.search_engine_priority
            sitemap.append(url_item)
            return sitemap
        else:
            return []

Migrate the changes and you'll see the additional fields on your Promote tab in the Page edit view.

Choose a page and deselect the option Allow search engines to index this page? and check your sitemap to verify the page is no longer listed.

Set the option to true again and choose values for Search Engine Change Frequency and Search Engine Priority and verify these are shown in the sitemap.

<url>
    <loc>http://localhost/en/tech-blog/</loc>
    <lastmod>2022-12-21</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
</url>
fa-solid fa-triangle-exclamation fa-lg Warning
If you are translating your site with wagtail-localize, any changes made to these custom fields on a page must be synchronised to your translated pages to take effect in those locales.

Configuring Multi-lingual Sites with Custom Sitemap View

Using wagtail-localize, I found that only the default language pages get added to the sitemap. I've added it as a bug/feature request, so I hope to see it in a future release.

To get around this, I dropped the Wagtail sitemap generator and created my own view.

xhtml Links

Google documentation recommends adding a <xhtml:link> entry for each translation of the page entry, including the page itself, with an additional x-default entry for unmatched languages.

<link rel="alternate" href="https://example.com/en-gb" hreflang="en-gb" />
<link rel="alternate" href="https://example.com/en-us" hreflang="en-us" />
<link rel="alternate" href="https://example.com/en-au" hreflang="en-au" />
<link rel="alternate" href="https://example.com/country-selector" hreflang="x-default" />

fa-solid fa-triangle-exclamation fa-lg Warning

Surprisingly, there's an error in the Google documentation that states the xhtml namespace should be specified as

  • xmlns:xhtml="http://www.w3.org/1999/xhtml"

If you use this with xhtml links in your sitemap, you'll see that the xml isn't parsed correctly.

I went with the latest version that w3c recommends which parses as expected.

  • xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html"

This same error exists in the Django sitemaps.xml template.

Customise the Django sitemap template

The default Django template works well for our purpose and saves a lot of recoding, but the schema needs updating to overcome the above error.

Copy the template from django/contrib/sitemaps/templates/sitemap.xml, or copy the code below, to the root of your site template folder (or to your preferred location) and make the update to the xhtml schema. You should have the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset 
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
  xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html"
>
{%spaceless%}
{%for url in urlset%}
  <url>
    <loc>{{url.location}}</loc>
    {%if url.lastmod%}<lastmod>{{url.lastmod|date:"Y-m-d"}}</lastmod>{%endif%}
    {%if url.changefreq%}<changefreq>{{url.changefreq}}</changefreq>{%endif%}
    {%if url.priority%}<priority>{{url.priority}}</priority>{%endif%}
    {%for alternate in url.alternates%}
    <xhtml:link rel="alternate" hreflang="{{alternate.lang_code}}" href="{{alternate.location}}"/>
    {%endfor%}
  </url>
{%endfor%}
{%endspaceless%}
</urlset>

Add a get_alternates() Page method:

Examining the template above, the code looks for an alternates key value in each urlset entry. The value is expected to be an array of dictionaries, similar to the urlset variable. Each alternate value should have lang_code and location key/value pairs to define language and url for each translation to the current page (including a self-reference to the current page).

As well as each translation, we need to add an entry with lang_code='x-default' with location pointing to the default url for that page.

With wagtail-localize, we can use the url of the default language page with the language code stripped out. If my page in the default language was /en/pages/page-one, then the x-default location would be /pages/page-one.

To build the alternates value for each page, I'll head back to my SEOPageMixin (see above) and add a further method:

class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ...
    def get_alternates(self):
        default_locale = Locale.get_default()
        x_default = None

        trans_pages = self.get_translations(inclusive=True)
        if trans_pages.count() > 1:
            alt = []
            for page in trans_pages:
                alt.append({
                    'lang_code': page.locale.language_code,
                    'location': page.get_full_url()
                })
                if page.locale == default_locale:
                    x_default = page.get_url_parts()
            
            if not x_default: # page not translated to default language, use first trans_page instead
                x_default = trans_pages.first().get_url_parts()

            # x-default - strip the language component from the url for the default-lang page
            # https://example.com/en/something/ -> https://example.com/something/
            x_default = f"{x_default[1]}/{'/'.join(x_default[2].split('/')[2:])}"
            alt.append({'lang_code': 'x-default', 'location': x_default})

            return alt
        else:
            return None
fa-regular fa-pen-to-square fa-lg Note
When a page has been translated, but not into the default language, the x-default page is set to the first in the trans_page queryset. It's not strictly correct, but the url will work regardless, so long as it is consistent across all the translated pages for this page.

Now, the get_alternates() method for my default homepage returns the following value:

[
  {'lang_code': 'en', 'location': 'http://localhost:8000/en/'}, 
  {'lang_code': 'es', 'location': 'http://localhost:8000/es/'}, 
  {'lang_code': 'x-default', 'location': 'http://localhost:8000/'}
]

Amend get_sitemap_urls()

To the get_sitemap_urls method we defined in the previous section:

  • We need to add the alternates value to the urlset item for that page.
  • Since we're using a custom view, we don't need the call to super().get_sitemap_urls(request).
  • We'll just return the url_item dictionary rather than the entire sitemap. We'll append the url_item in the view instead.
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ...
    def get_sitemap_urls(self, request):
        if self.search_engine_index:
            url_item = {
                "location": self.full_url,
                "lastmod": self.lastmod,
                "alternates": self.get_alternates()
            }
            if self.search_engine_changefreq:
                url_item["changefreq"] = self.search_engine_changefreq
            if self.search_engine_priority:
                url_item["priority"] = self.search_engine_priority
            
            return url_item
        else:
            return []

Create a Custom Sitemap view

To tie this all together, we need a custom view that finds the homepage for the requested site, gets all translations of that page and iterates through all the children of each.

From this, we render a TemplateResponse using the customised template and the urlset. To this, we add an X-Robots-Tag header entry to tell search engines not to index or archive this page. We also calculate the most recent update from the listed pages and use this as the last-modified header value.

fa-solid fa-code fa-lg urlset.remove([]) pops out any empty arrays returned from the get_sitemap_urls methods - this method throws an error if the value is not found (ironically) so it's wrapped in a try/except.

I have a core app on my sites where I keep such code, adjust to your own site:

# core.views.py

from django.template.response import TemplateResponse
from django.utils.http import http_date
from wagtail.models import Page, Site

def sitemap(request):
    site = Site.find_for_request(request)
    root_page = Page.objects.defer_streamfields().get(id=site.root_page_id)

    urlset = []
    for locale_home in root_page.get_translations(inclusive=True).defer_streamfields().live().public().specific():
        urlset.append(locale_home.get_sitemap_urls(request))
        for child_page in locale_home.get_descendants().defer_streamfields().live().public().specific():
            urlset.append(child_page.get_sitemap_urls(request))
    try:
        urlset.remove([])
    except:
        pass
    last_modified = max([x['lastmod'] for x in urlset])

    return TemplateResponse(
        request, 
        template='sitemap.xml', 
        context={'urlset': urlset},
        content_type='application/xml',
        headers={
            "X-Robots-Tag": "noindex, noodp, noarchive", 
            "last-modified": http_date(last_modified.timestamp()),
            "vary": "Accept-Encoding",
            }
        )
fa-solid fa-code fa-lg defer_streamfields()
Apply to a queryset to prevent fetching/decoding of StreamField values on evaluation. Useful when working with potentially large numbers of results, where StreamField values are unlikely to be needed. For example, when generating a sitemap or a long list of page links.

Amend the sitemap urls.py entry

In your site urls.py, remove the import for the Wagtail sitemap view (if you were using it), import the one you just created and add:

from core.views import sitemap
from django.urls import re_path
...
urlpatterns = [
    ...
    re_path(r'^sitemap.xml$', sitemap, name='sitemap'),
    ...
]

You can also remove Wagtail and Django sitemaps from your INSTALLED_APPS list now.

Example Output

I'll use a test site that has English and Spanish enabled with home page, blog index and three blog pages. All pages except the third blog page have been translated.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html">
    <url>
        <loc>http://localhost:8000/en/</loc>
        <lastmod>2022-09-04</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/first-tech-blog/</loc>
        <lastmod>2022-09-26</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/first-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/second-tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <changefreq>weekly</changefreq>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/second-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/third-tech-blog/</loc>
        <lastmod>2022-12-16</lastmod>
    </url>
    <url>
        <loc>http://localhost:8000/es/</loc>
        <lastmod>2022-08-16</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/primer-tech-blog/</loc>
        <lastmod>2022-09-06</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/first-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/segundo-tech-blog/</loc>
        <lastmod>2022-08-18</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/tech-blog/second-tech-blog/" />
    </url>
</urlset>

Notify Google of Sitemap Updates

To submit your sitemap to Google programmatically, you can send a get request to the ping tool in the format:

  • https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP

You can do this either on page publish or delete, set this as a cron job to run periodically, or just add a button to your admin site to run this manually.

I've just used Google in the example below, add other search engines as an array and loop through this for a more thorough approach.

First, I create the method to ping Google with the sitemap URL, saved to utils.py in my core app:

# core.utils.py

from urllib.parse import urlencode
from urllib.request import urlopen
from django.urls import reverse

PING_URL = "https://www.google.com/webmasters/tools/ping"

def ping_google(request, ping_url=PING_URL):
    try:
        sitemap = request.build_absolute_uri(reverse('sitemap'))
        params = urlencode({"sitemap": sitemap})
        urlopen(f"{ping_url}?{params}")
    except Exception as e:
        print(f"{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}")
fa-regular fa-pen-to-square fa-lg Note
reverse('sitemap') assumes I have named my sitemap view in urls.py with 'sitemap':
re_path(r'^sitemap.xml$', sitemap, name='sitemap')
Adjust for your site if necessary.

For a Wagtail site publishing occasionally, I'll ping Google each time a page is updated, published or deleted using hooks. Be sure to add the test for debug to limit this to your production site:

# wagtail_hooks.py

from wagtail import hooks
from .utils import ping_google

@hooks.register('after_delete_page')
def do_after_delete_page(request, page):
    if not settings.DEBUG:
        ping_google(request)

@hooks.register("after_publish_page")
def do_after_publish(request, page):
    if not settings.DEBUG:
        ping_google(request)

Conclusion

On this page, I covered creating dynamic sitemaps using the built-in Wagtail sitemap app, and by creating a custom view.

I also covered:

  • How to add entries for routeable pages.
  • How to hide pages from the sitemap.
  • How to alter the lastmod field, and add values for the changefreq and priority fields both on a site-wide or class level.
  • How to add custom fields to a mixin to allow these values to be changed on a per page basis.
  • How to build a sitemap for multi-lingual sites with a custom view.
  • How to notify Google of sitemap updates.

  Please feel free to leave any questions or comments below, or send me a message here