Configuring a Dynamic Sitemap on Wagtail

Introduction

A sitemap lists a website’s most important pages, making sure search engines can find and crawl them. It's important to keep your sitemap up to date for optimal SEO. With a quick bit of coding, you can set your sitemap to be created dynamically on demand, ensuring it always reflects the latest content.

Creating a dynamic sitemap for your site is straightforward in Wagtail. A fresh copy will be rendered each time it is requested ensuring it reflects the current content. After the brief setup, and without additional coding, this will crawl all the live Wagtail pages in the default language for your site. For multi-lingual sites, see the final section on how to deal with this.

Creating a Dynamic Sitemap with Wagtail Sitemaps View

To your base.py add Django and Wagtail sitemaps to your installed apps:

Copy
'wagtail.contrib.sitemaps',
'django.contrib.sitemaps',

In your site's root urls.py add the following import:

Copy
from wagtail.contrib.sitemaps.views import sitemap

and in the urlpatterns, above the catch-all:

Copy
url(r'^sitemap.xml$', sitemap),

Now, browsing to example.com/sitemap.xml shows something similar to:

Copy
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2021-05-17</lastmod>
  </url>
  <url>
    <loc>https://example.com/contact/</loc>
    <lastmod>2021-05-17</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/</loc>
    <lastmod>2021-06-02</lastmod>
  </url>
  <url>
    <loc>https://example.com/services/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
  <url>
    <loc>https://example.com/about/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
  <url>
    <loc>https://example.com/privacy/</loc>
    <lastmod>2021-05-19</lastmod>
  </url>
</urlset>

You can see the auto-generated sitemap.xml for this site here.

Adding Support for Routable Pages

If you're using routable pages on your site, you might want to add these as well.

Go to each class with routable pages and override the default get_sitemap_urls method called for each page. Add the following method to the class:

Copy
Class SomeRoutablePage(Page):
    ....

    def get_sitemap_urls(self):
        sitemap = super().get_sitemap_urls()
        sitemap.append(
            {
                "location": self.full_url + self.reverse_subpage('routable_page_name'),
                "lastmod": self.last_published_at or self.latest_revision_created_at,
            }
        )
        return sitemap

Hiding Pages from the Sitemap

If, for some reason, you have a page class that you don’t want to show in the sitemap (any pages that you don’t want indexed, or an empty redirect page), override get_sitemap_urls and return an empty set:

Copy
def get_sitemap_urls(self):
    return[]

Changing the lastmod, changefreq and priority fields

The method for getting the lastmod value may not be appropriate if the page content is updated by external source, it may be better to search when the latest item in a list would be for example.

On the blog listing page, we might want to set the lastmod field to the date of the most recent blog post update.

You could also add items to the dictionary to give the page priority and changefreq values:

Copy
class BlogListingPage(SEOPage):
    ...

    def get_sitemap_urls(self):
        sitemap = super().get_sitemap_urls()
        lastmod_blog = self.get_children().defer_streamfields().live().public().order_by('last_published_at').last()
        sitemap.append(
            {
                "location": self.full_url,
                "lastmod": lastmod_blog.last_published_at,
                "changefreq": "weekly",
                "priority": 0.3
            }
        )
        return sitemap
Copy
<url>
    <loc>http://example.com/blog/</loc>
    <lastmod>2021-12-20</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.3</priority>
</url>
Attention

It's worth reading through the notes on priority and changefreq on sitemaps.org before using these fields. It's also worth noting that Google does not take these fields into consideration when indexing a site.

Enabling sitemap properties on a per page basis

The above methods for configuring sitemap properties work on a site-wide or per class basis. What if you want to just edit properties for a particular pages without affecting the entire class?

Here, I'm going to return to the SEOPageMixin model I introduced in an earlier blog which all my pages inherit. Where you add these is entirely up to how your site is configured, adjust as suits you.

Add Page class properties

I'll add three fields to my mixin:

  1. a Boolean field search_engine_index to instruct my sitemap view whether to include the page
  2. an optional CharField with choices to indicate the change frequency
  3. an optional DecimalField to indicate page priority (max_digits=2, decimal_places=1 which limits the range from 0.0 to 1.0)

The panels are added to the end of the Promote tab.

Note

Remember, excluding a page from the sitemap does not prevent a search engine from indexing it since the site will also be crawled from internal and external links. You should, at very least, conditionally add <meta name="robots" content="noindex"> to a page's <head> where search_engine_index=False or use some other means of instructing the search engine not to index that page.

Add methods

There are two methods to add at this point:

  1. lastmod property is here to make overriding this at class level easier, you only need to redefine this property rather than the whole get_sitemap_urls method
  2. get_sitemap_urls overrides the default method from Wagtail's sitemap as discussed previously and takes the new custom fields into account. changefreq and priority are only added if they exist in the page instance. If the page instance has search_engine_index=False, an empty array is returned to skip adding it to the sitemap.
core/models.py
Copy
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ....
    search_engine_index = models.BooleanField(
        blank=False,
        null=False,
        default=True,
        verbose_name=_("Allow search engines to index this page?")
    )

    search_engine_changefreq = models.CharField(
        max_length=25,
        choices=[
            ("always", _("Always")),
            ("hourly", _("Hourly")),
            ("daily", _("Daily")),
            ("weekly", _("Weekly")),
            ("monthly", _("Monthly")),
            ("yearly", _("Yearly")),
            ("never", _("Never")),
        ],
        blank=True,
        null=True,
        verbose_name=_("Search Engine Change Frequency (Optional)"),
        help_text=_("How frequently the page is likely to change? (Leave blank for default)")
    )

    search_engine_priority = models.DecimalField(
        max_digits=2, 
        decimal_places=1,
        blank=True,
        null=True,
        verbose_name=_("Search Engine Priority (Optional)"),
        help_text=_("The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. (Leave blank for default)")
    )
    ....
    promote_panels = [
        ...
        MultiFieldPanel([
            FieldPanel('search_engine_index'),
            FieldPanel('search_engine_changefreq'),
            FieldPanel('search_engine_priority'),
        ], _("Search Engine Indexing")),
    ]
    ....
    @property
    def lastmod(self):
        return self.last_published_at or self.latest_revision_created_at

    def get_sitemap_urls(self):
        sitemap = super().get_sitemap_urls()
        if self.search_engine_index:
            url_item = {
                "location": self.full_url,
                "lastmod": self.lastmod
            }
            if self.search_engine_changefreq:
                url_item["changefreq"] = self.search_engine_changefreq
            if self.search_engine_priority:
                url_item["priority"] = self.search_engine_priority
            sitemap.append(url_item)
            return sitemap
        else:
            return []

Migrate the changes and you'll see the additional fields on your Promote tab in the Page edit view.

Choose a page and deselect the option Allow search engines to index this page? and check your sitemap to verify the page is no longer listed.

Set the option to true again and choose values for Search Engine Change Frequency and Search Engine Priority and verify these are shown in the sitemap.

Copy
<url>
    <loc>http://localhost/en/tech-blog/</loc>
    <lastmod>2022-12-21</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
</url>
Warning

If you are translating your site with wagtail-localize, any changes made to these custom fields on a page must be synchronised to your translated pages to take effect in those locales.

Configuring Multi-lingual Sites with Custom Sitemap View

Using wagtail-localize, I found that only the default language pages get added to the sitemap. I've added it as a bug/feature request, so I hope to see it in a future release.

To get around this, I dropped the Wagtail sitemap generator and created my own view.

xhtml Links

Google documentation recommends adding a <xhtml:link> entry for each translation of the page entry, including the page itself, with an additional x-default entry for unmatched languages.

Copy
<link rel="alternate" href="https://example.com/en-gb" hreflang="en-gb" />
<link rel="alternate" href="https://example.com/en-us" hreflang="en-us" />
<link rel="alternate" href="https://example.com/en-au" hreflang="en-au" />
<link rel="alternate" href="https://example.com/country-selector" hreflang="x-default" />
Warning

Google requires the xhtml namespace should be specified as

Copy
xmlns:xhtml="http://www.w3.org/1999/xhtml"

If you use this with xhtml links in your sitemap, you'll see that the xml isn't parsed correctly in your browser. Your alternates entries won't be visible and your <loc> entries will be displayed as one long string. While it looks like an error, this is the namespace required by Google.

You can use the following to view your sitemap in your browser, however Google will report an error that the xmlns:xhtml namespace hasn't been declared if you submit this. Use this for testing only:

Copy
xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html"

Alternatively, just view the page source for the sitemap.xml in your browser.

Customise the Django sitemap template

The default Django template works well for our purpose and saves a lot of recoding.

Since we are not using the Django or Wagtail sitemaps app here, copy the template from django/contrib/sitemaps/templates/sitemap.xml, or copy the code below, to the root of your site template folder (or to your preferred location). You should have the following:

templates/sitemap.xml
Copy
<?xml version="1.0" encoding="UTF-8"?>
<urlset 
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
{% spaceless %}
{% for url in urlset %}
  <url>
    <loc>{{ url.location }}</loc>
    {% if url.lastmod %}<lastmod>{{ url.lastmod|date:"Y-m-d" }}</lastmod>{% endif %}
    {% if url.changefreq %}<changefreq>{{ url.changefreq }}</changefreq>{% endif %}
    {% if url.priority %}<priority>{{ url.priority }}</priority>{% endif %}
    {% for alternate in url.alternates %}
    <xhtml:link rel="alternate" hreflang="{{ alternate.lang_code }}" href="{{ alternate.location }}"/>
    {% endfor %}
  </url>
{% endfor %}
{% endspaceless %}
</urlset>

Add an alternates Page property

Examining the template above, the code looks for an alternates key value in each urlset entry. The value is expected to be an array of dictionaries, similar to the urlset variable. Each alternate value should have lang_code and location key/value pairs to define language and url for each translation to the current page (including a self-reference to the current page).

As well as each translation, we need to add an entry with lang_code='x-default' with location pointing to the default url for that page.

With wagtail-localize, we can use the url of the default language page. If the page doesn't exist in the default language, we can use the first element of the translated pages queryset which will be the source page for those translations.

To build the alternates value for each page, I'll head back to my SEOPageMixin (see above) and add the further methods:

core/models.py
Copy
from django.utils.functional import cached_property
....

class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ...
    @cached_property
    def translations(self):
        """
        Return dict of lang-code/url key/value pairs for each page that has a live translation including self
        Urls are relative.
        """
        return {
            page.locale.language_code: page.url
            for page in self.get_translations(inclusive=True)
            .live()
            .defer_streamfields()
        }

    @cached_property
    def alternates(self):
        """
        Create list of translations for  head entries.
        Convert translations dict into list of dictionaries with lang_code and location keys for each translations item.
        Convert translations urls to absolute urls instead of relative urls
        Add x-default value.
        """
        default_lang_code = Locale.get_default().language_code
        site_root = self.get_site().root_url
        alt = [
            {"lang_code": key, "location": f"{site_root}{value}"}
            for key, value in self.translations.items()
        ]
        x_default = self.translations.get(default_lang_code)
        if not x_default:
            # doesn't exist in default locale, use the first locale in the translations
            try:
                x_default = list(self.translations.items())[0][1]
            except: # translations empty before original page first published
                x_default = self.url
        alt.append({"lang_code": "x-default", "location": f"{site_root}{x_default}"})
        return alt
Note

I've split the the property into two to perform another function with the alternates.

  • Translations: This will be used to build a language switcher on page load, best with relative urls. The language switcher will be covered in a later article. The property returns a dictionary similar the following:
Copy
{'en': '/en/products/',
 'de': '/de/produkte/',
 'es': '/es/productos/',
 'fr': '/fr/produits/'}
  • Alternates: Aside from the sitemap, this can also be used to build the <link rel="alternate" ...> elements for the page <head>. This property can be built out of the translations dictionary without calling get_translations() a second time. These should be absolute urls, the root url for the site that the page belongs to is injected when building the array of values. Those links can now be added to the page template with:
Copy
{% for alt in self.alternates %}
    <link rel="alternate" hreflang="{{ alt.lang_code }}" href="{{ alt.location }}">
{% endfor %}
Copy
<link rel="alternate" hreflang="en" href="http://localhost:8000/en/products/">
<link rel="alternate" hreflang="de" href="http://localhost:8000/de/produkte/">
<link rel="alternate" hreflang="es" href="http://localhost:8000/es/productos/">
<link rel="alternate" hreflang="fr" href="http://localhost:8000/fr/produit/">
<link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/products/">

Using @cached_property means these values are calculated only once per page load.

When a page has been translated, but not into the default language, the x-default page is set to the first in the translations items. It's not strictly correct, but the url will work regardless, so long as it is consistent across all the translated pages for this page.

Now, the alternates property for my default homepage returns the following value:

Copy
[
  {'lang_code': 'en', 'location': 'http://localhost:8000/en/'}, 
  {'lang_code': 'es', 'location': 'http://localhost:8000/es/'}, 
  {'lang_code': 'x-default', 'location': 'http://localhost:8000/en/'}
]

Amend get_sitemap_urls()

To the get_sitemap_urls method we defined in the previous section:

  • We need to add the alternates value to the urlset item for that page.
  • Since we're using a custom view, we don't need the call to super().get_sitemap_urls().
  • We'll just return the url_item dictionary rather than the entire sitemap. We'll append the url_item in the view instead.
core/models.py
Copy
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
    ...
    def get_sitemap_urls(self):
        if self.search_engine_index:
            url_item = {
                "location": self.full_url,
                "lastmod": self.lastmod,
                "alternates": self.alternates
            }
            if self.search_engine_changefreq:
                url_item["changefreq"] = self.search_engine_changefreq
            if self.search_engine_priority:
                url_item["priority"] = self.search_engine_priority
            
            return url_item
        else:
            return []

Create a Custom Sitemap view

To tie this all together, we need a custom view that finds the homepage for the requested site, gets all translations of that page and iterates through all the children of each.

From this, we render a TemplateResponse using the customised template and the urlset. To this, we add an X-Robots-Tag header entry to tell search engines not to index or archive this page. We also calculate the most recent update from the listed pages and use this as the last-modified header value.

fa-solid fa-code fa-xl urlset.remove([]) pops out any empty arrays returned from the get_sitemap_urls methods - this method throws an error if the value is not found (ironically) so it's wrapped in a try/except.

I have a core app on my sites where I keep such code, adjust to your own site:

core/views.py
Copy
from datetime import datetime

from django.template.response import TemplateResponse
from django.utils.http import http_date
from wagtail.models import Page, Site

def sitemap(request):
    site = Site.find_for_request(request)
    urlset = []
    # Find root page for the site and translations of root page
    for locale_home in (
        site.root_page.get_translations(inclusive=True)
        .live()
        .defer_streamfields()
        .specific()
    ):
        # For every page in each locale tree, add url set entry
        for page in (
            locale_home.get_descendants(inclusive=True)
            .live()
            .defer_streamfields()
            .specific()
        ):
            if page.search_engine_index:
                urlset.append(page.get_sitemap_urls())

    # strip any empty entries
    try:
        urlset.remove([])
    except:
        pass

    try:
        # get the last_modified value for the sitemap header
        last_modified = max([x['lastmod'] for x in urlset])
    except Exception as e:
        # either urlset is empty or lastmod fields not present, set last modified to now
        print(f"\n{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}\n") 
        last_modified = datetime.now()

    return TemplateResponse(
        request, 
        template='sitemap.xml', 
        context={'urlset': urlset},
        content_type='application/xml',
        headers={
            "X-Robots-Tag": "noindex, noodp, noarchive", 
            "last-modified": http_date(last_modified.timestamp()),
            "vary": "Accept-Encoding",
            }
        )
fa-solid fa-code fa-xl defer_streamfields()
Apply to a queryset to prevent fetching/decoding of StreamField values on evaluation. Useful when working with potentially large numbers of results, where StreamField values are unlikely to be needed. For example, when generating a sitemap or a long list of page links.

Amend the sitemap urls.py entry

In your site urls.py, remove the import for the Wagtail sitemap view (if you were using it), import the one you just created and add:

urls.py
Copy
from core.views import sitemap
from django.urls import re_path
...
urlpatterns = [
    ...
    re_path(r'^sitemap.xml$', sitemap, name='sitemap'),
    ...
]

You can also remove Wagtail and Django sitemaps from your INSTALLED_APPS list now.

Example Output

I'll use a test site that has English and Spanish enabled with home page, blog index and three blog pages. All pages except the third blog page have been translated.

Copy
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xhtml="xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <url>
        <loc>http://localhost:8000/en/</loc>
        <lastmod>2022-09-04</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/first-tech-blog/</loc>
        <lastmod>2022-09-26</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/second-tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <changefreq>weekly</changefreq>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/en/tech-blog/third-tech-blog/</loc>
        <lastmod>2022-12-16</lastmod>
    </url>
    <url>
        <loc>http://localhost:8000/es/</loc>
        <lastmod>2022-08-16</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/</loc>
        <lastmod>2022-12-21</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/primer-tech-blog/</loc>
        <lastmod>2022-09-06</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
    </url>
    <url>
        <loc>http://localhost:8000/es/tech-blog/segundo-tech-blog/</loc>
        <lastmod>2022-08-18</lastmod>
        <xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
        <xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
    </url>
</urlset>

Notify Google of Sitemap Updates

Warning

Google have announced that they will be deprecating the ping API at the end of 2023. Other search engines may continue their ping API's, adjust for any that you might want to use.

To submit your sitemap to Google programmatically, you can send a get request to the ping tool in the format:

You can do this either on page publish or delete, set this as a cron job to run periodically, or just add a button to your admin site to run this manually.

I've just used Google in the example below, add other search engines as an array and loop through this for a more thorough approach.

First, I create the method to ping Google with the sitemap URL, saved to utils.py in my core app:

core/utils.py
Copy
from urllib.parse import urlencode
from urllib.request import urlopen
from django.urls import reverse

PING_URL = "https://www.google.com/webmasters/tools/ping"

def ping_google(request, ping_url=PING_URL):
    try:
        sitemap = request.build_absolute_uri(reverse('sitemap'))
        params = urlencode({"sitemap": sitemap})
        urlopen(f"{ping_url}?{params}")
    except Exception as e:
        print(f"{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}")
Note

reverse('sitemap') assumes I have named my sitemap view in urls.py with 'sitemap':

Copy
re_path(r'^sitemap.xml$', sitemap, name='sitemap')

Adjust for your site if necessary.

For a Wagtail site publishing occasionally, I'll ping Google each time a page is updated, published or deleted using hooks. Be sure to add the test for debug to limit this to your production site:

wagtail_hooks.py
Copy
from django.conf import settings
from wagtail import hooks
from .utils import ping_google

@hooks.register('after_delete_page')
def do_after_delete_page(request, page):
    if not settings.DEBUG:
        ping_google(request)

@hooks.register("after_publish_page")
def do_after_publish(request, page):
    if not settings.DEBUG:
        ping_google(request)

Conclusion

In this guide, we explored the intricacies of dynamic sitemap creation in Wagtail, leveraging both the built-in sitemap app and custom views. Key highlights include:

  1. Incorporating entries for routeable pages to enhance sitemap comprehensiveness.
  2. Effectively hiding specific pages from the sitemap for tailored SEO strategies.
  3. Fine-tuning the lastmod field and introducing values for changefreq and priority fields at both site-wide and class levels.
  4. Empowering custom fields within mixins, enabling per-page adjustments to enhance flexibility.
  5. Crafting sitemaps for multilingual sites via a bespoke view, ensuring inclusivity.
  6. Proactively notifying Google of sitemap updates for optimal search engine visibility.

By mastering these techniques, you're equipped to sculpt precise and dynamic sitemaps tailored to your Wagtail projects, optimizing search engine interactions and bolstering your site's overall SEO performance.


  Please feel free to leave any questions or comments below, or send me a message here