Configuring a Dynamic Sitemap on Wagtail
Introduction
A sitemap lists a website’s most important pages, making sure search engines can find and crawl them. It's important to keep your sitemap up to date for optimal SEO. With a quick bit of coding, you can set your sitemap to be created dynamically on demand, ensuring it always reflects the latest content.
Creating a dynamic sitemap for your site is straightforward in Wagtail. A fresh copy will be rendered each time it is requested ensuring it reflects the current content. After the brief setup, and without additional coding, this will crawl all the live Wagtail pages in the default language for your site. For multi-lingual sites, see the final section on how to deal with this.
Creating a Dynamic Sitemap with Wagtail Sitemaps View
To your base.py
add Django and Wagtail sitemaps to your installed apps:
'wagtail.contrib.sitemaps',
'django.contrib.sitemaps',
In your site's root urls.py
add the following import:
from wagtail.contrib.sitemaps.views import sitemap
and in the urlpatterns
, above the catch-all:
url(r'^sitemap.xml$', sitemap),
Now, browsing to example.com/sitemap.xml shows something similar to:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2021-05-17</lastmod>
</url>
<url>
<loc>https://example.com/contact/</loc>
<lastmod>2021-05-17</lastmod>
</url>
<url>
<loc>https://example.com/blog/</loc>
<lastmod>2021-06-02</lastmod>
</url>
<url>
<loc>https://example.com/services/</loc>
<lastmod>2021-05-19</lastmod>
</url>
<url>
<loc>https://example.com/about/</loc>
<lastmod>2021-05-19</lastmod>
</url>
<url>
<loc>https://example.com/privacy/</loc>
<lastmod>2021-05-19</lastmod>
</url>
</urlset>
You can see the auto-generated sitemap.xml for this site here.
Adding Support for Routable Pages
If you're using routable pages on your site, you might want to add these as well.
Go to each class with routable pages and override the default get_sitemap_urls
method called for each page. Add the following method to the class:
Class SomeRoutablePage(Page):
....
def get_sitemap_urls(self):
sitemap = super().get_sitemap_urls()
sitemap.append(
{
"location": self.full_url + self.reverse_subpage('routable_page_name'),
"lastmod": self.last_published_at or self.latest_revision_created_at,
}
)
return sitemap
Hiding Pages from the Sitemap
If, for some reason, you have a page class that you don’t want to show in the sitemap (any pages that you don’t want indexed, or an empty redirect page), override get_sitemap_urls
and return an empty set:
def get_sitemap_urls(self):
return[]
Changing the lastmod, changefreq and priority fields
The method for getting the lastmod
value may not be appropriate if the page content is updated by external source, it may be better to search when the latest item in a list would be for example.
On the blog listing page, we might want to set the lastmod
field to the date of the most recent blog post update.
You could also add items to the dictionary to give the page priority
and changefreq
values:
class BlogListingPage(SEOPage):
...
def get_sitemap_urls(self):
sitemap = super().get_sitemap_urls()
lastmod_blog = self.get_children().defer_streamfields().live().public().order_by('last_published_at').last()
sitemap.append(
{
"location": self.full_url,
"lastmod": lastmod_blog.last_published_at,
"changefreq": "weekly",
"priority": 0.3
}
)
return sitemap
<url>
<loc>http://example.com/blog/</loc>
<lastmod>2021-12-20</lastmod>
<changefreq>weekly</changefreq>
<priority>0.3</priority>
</url>
It's worth reading through the notes on priority
and changefreq
on sitemaps.org before using these fields. It's also worth noting that Google does not take these fields into consideration when indexing a site.
Enabling sitemap properties on a per page basis
The above methods for configuring sitemap properties work on a site-wide or per class basis. What if you want to just edit properties for a particular pages without affecting the entire class?
Here, I'm going to return to the SEOPageMixin
model I introduced in an earlier blog which all my pages inherit. Where you add these is entirely up to how your site is configured, adjust as suits you.
Add Page class properties
I'll add three fields to my mixin:
- a
Boolean
fieldsearch_engine_index
to instruct my sitemap view whether to include the page - an optional
CharField
with choices to indicate the change frequency - an optional
DecimalField
to indicate page priority (max_digits=2
,decimal_places=1
which limits the range from 0.0 to 1.0)
The panels are added to the end of the Promote tab.
Remember, excluding a page from the sitemap does not prevent a search engine from indexing it since the site will also be crawled from internal and external links. You should, at very least, conditionally add <meta name="robots" content="noindex">
to a page's <head>
where search_engine_index=False
or use some other means of instructing the search engine not to index that page.
Add methods
There are two methods to add at this point:
lastmod
property is here to make overriding this at class level easier, you only need to redefine this property rather than the wholeget_sitemap_urls
methodget_sitemap_urls
overrides the default method from Wagtail's sitemap as discussed previously and takes the new custom fields into account.changefreq
andpriority
are only added if they exist in the page instance. If the page instance hassearch_engine_index=False
, an empty array is returned to skip adding it to the sitemap.
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
....
search_engine_index = models.BooleanField(
blank=False,
null=False,
default=True,
verbose_name=_("Allow search engines to index this page?")
)
search_engine_changefreq = models.CharField(
max_length=25,
choices=[
("always", _("Always")),
("hourly", _("Hourly")),
("daily", _("Daily")),
("weekly", _("Weekly")),
("monthly", _("Monthly")),
("yearly", _("Yearly")),
("never", _("Never")),
],
blank=True,
null=True,
verbose_name=_("Search Engine Change Frequency (Optional)"),
help_text=_("How frequently the page is likely to change? (Leave blank for default)")
)
search_engine_priority = models.DecimalField(
max_digits=2,
decimal_places=1,
blank=True,
null=True,
verbose_name=_("Search Engine Priority (Optional)"),
help_text=_("The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. (Leave blank for default)")
)
....
promote_panels = [
...
MultiFieldPanel([
FieldPanel('search_engine_index'),
FieldPanel('search_engine_changefreq'),
FieldPanel('search_engine_priority'),
], _("Search Engine Indexing")),
]
....
@property
def lastmod(self):
return self.last_published_at or self.latest_revision_created_at
def get_sitemap_urls(self):
sitemap = super().get_sitemap_urls()
if self.search_engine_index:
url_item = {
"location": self.full_url,
"lastmod": self.lastmod
}
if self.search_engine_changefreq:
url_item["changefreq"] = self.search_engine_changefreq
if self.search_engine_priority:
url_item["priority"] = self.search_engine_priority
sitemap.append(url_item)
return sitemap
else:
return []
Migrate the changes and you'll see the additional fields on your Promote tab in the Page edit view.
Choose a page and deselect the option Allow search engines to index this page? and check your sitemap to verify the page is no longer listed.
Set the option to true again and choose values for Search Engine Change Frequency and Search Engine Priority and verify these are shown in the sitemap.
<url>
<loc>http://localhost/en/tech-blog/</loc>
<lastmod>2022-12-21</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
Warning
If you are translating your site withwagtail-localize
, any changes made to these custom fields on a page must be synchronised to your translated pages to take effect in those locales.
Configuring Multi-lingual Sites with Custom Sitemap View
Using wagtail-localize
, I found that only the default language pages get added to the sitemap. I've added it as a bug/feature request, so I hope to see it in a future release.
To get around this, I dropped the Wagtail sitemap generator and created my own view.
xhtml Links
Google documentation recommends adding a <xhtml:link>
entry for each translation of the page entry, including the page itself, with an additional x-default
entry for unmatched languages.
<link rel="alternate" href="https://example.com/en-gb" hreflang="en-gb" />
<link rel="alternate" href="https://example.com/en-us" hreflang="en-us" />
<link rel="alternate" href="https://example.com/en-au" hreflang="en-au" />
<link rel="alternate" href="https://example.com/country-selector" hreflang="x-default" />
Google requires the xhtml
namespace should be specified as
xmlns:xhtml="http://www.w3.org/1999/xhtml"
If you use this with xhtml
links in your sitemap, you'll see that the xml isn't parsed correctly in your browser. Your alternates entries won't be visible and your <loc>
entries will be displayed as one long string. While it looks like an error, this is the namespace required by Google.
You can use the following to view your sitemap in your browser, however Google will report an error that the xmlns:xhtml
namespace hasn't been declared if you submit this. Use this for testing only:
xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html"
Alternatively, just view the page source for the sitemap.xml in your browser.
Customise the Django sitemap template
The default Django template works well for our purpose and saves a lot of recoding.
Since we are not using the Django or Wagtail sitemaps
app here, copy the template from django/contrib/sitemaps/templates/sitemap.xml
, or copy the code below, to the root of your site template folder (or to your preferred location). You should have the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
{% spaceless %}
{% for url in urlset %}
<url>
<loc>{{ url.location }}</loc>
{% if url.lastmod %}<lastmod>{{ url.lastmod|date:"Y-m-d" }}</lastmod>{% endif %}
{% if url.changefreq %}<changefreq>{{ url.changefreq }}</changefreq>{% endif %}
{% if url.priority %}<priority>{{ url.priority }}</priority>{% endif %}
{% for alternate in url.alternates %}
<xhtml:link rel="alternate" hreflang="{{ alternate.lang_code }}" href="{{ alternate.location }}"/>
{% endfor %}
</url>
{% endfor %}
{% endspaceless %}
</urlset>
Add an alternates Page property
Examining the template above, the code looks for an alternates
key value in each urlset
entry. The value is expected to be an array of dictionaries, similar to the urlset
variable. Each alternate
value should have lang_code
and location
key/value pairs to define language and url for each translation to the current page (including a self-reference to the current page).
As well as each translation, we need to add an entry with lang_code='x-default'
with location
pointing to the default url for that page.
With wagtail-localize
, we can use the url of the default language page. If the page doesn't exist in the default language, we can use the first element of the translated pages queryset which will be the source page for those translations.
To build the alternates
value for each page, I'll head back to my SEOPageMixin
(see above) and add the further methods:
from django.utils.functional import cached_property
....
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
...
@cached_property
def translations(self):
"""
Return dict of lang-code/url key/value pairs for each page that has a live translation including self
Urls are relative.
"""
return {
page.locale.language_code: page.url
for page in self.get_translations(inclusive=True)
.live()
.defer_streamfields()
}
@cached_property
def alternates(self):
"""
Create list of translations for <link rel="alternate" ...> head entries.
Convert translations dict into list of dictionaries with lang_code and location keys for each translations item.
Convert translations urls to absolute urls instead of relative urls
Add x-default value.
"""
default_lang_code = Locale.get_default().language_code
site_root = self.get_site().root_url
alt = [
{"lang_code": key, "location": f"{site_root}{value}"}
for key, value in self.translations.items()
]
x_default = self.translations.get(default_lang_code)
if not x_default:
# doesn't exist in default locale, use the first locale in the translations
x_default = list(self.translations.items())[0][1]
alt.append({"lang_code": "x-default", "location": f"{site_root}{x_default}"})
return alt
I've split the the property into two to perform another function with the alternates.
- Translations: This will be used to build a language switcher on page load, best with relative urls. The language switcher will be covered in a later article. The property returns a dictionary similar the following:
{'en': '/en/products/',
'de': '/de/produkte/',
'es': '/es/productos/',
'fr': '/fr/produits/'}
- Alternates: Aside from the sitemap, this can also be used to build the
<link rel="alternate" ...>
elements for the page<head>
. This property can be built out of the translations dictionary without callingget_translations()
a second time. These should be absolute urls, the root url for the site that the page belongs to is injected when building the array of values. Those links can now be added to the page template with:
{% for alt in self.alternates %}
<link rel="alternate" hreflang="{{ alt.lang_code }}" href="{{ alt.location }}">
{% endfor %}
<link rel="alternate" hreflang="en" href="http://localhost:8000/en/products/">
<link rel="alternate" hreflang="de" href="http://localhost:8000/de/produkte/">
<link rel="alternate" hreflang="es" href="http://localhost:8000/es/productos/">
<link rel="alternate" hreflang="fr" href="http://localhost:8000/fr/produit/">
<link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/products/">
Using @cached_property
means these values are calculated only once per page load.
When a page has been translated, but not into the default language, the x-default
page is set to the first in the translations
items. It's not strictly correct, but the url will work regardless, so long as it is consistent across all the translated pages for this page.
Now, the alternates
property for my default homepage returns the following value:
[
{'lang_code': 'en', 'location': 'http://localhost:8000/en/'},
{'lang_code': 'es', 'location': 'http://localhost:8000/es/'},
{'lang_code': 'x-default', 'location': 'http://localhost:8000/en/'}
]
Amend get_sitemap_urls()
To the get_sitemap_urls
method we defined in the previous section:
- We need to add the
alternates
value to theurlset
item for that page. - Since we're using a custom view, we don't need the call to
super().get_sitemap_urls()
. - We'll just return the
url_item
dictionary rather than the entire sitemap. We'll append theurl_item
in the view instead.
class SEOPageMixin(index.Indexed, WagtailImageMetadataMixin, models.Model):
...
def get_sitemap_urls(self):
if self.search_engine_index:
url_item = {
"location": self.full_url,
"lastmod": self.lastmod,
"alternates": self.alternates
}
if self.search_engine_changefreq:
url_item["changefreq"] = self.search_engine_changefreq
if self.search_engine_priority:
url_item["priority"] = self.search_engine_priority
return url_item
else:
return []
Create a Custom Sitemap view
To tie this all together, we need a custom view that finds the homepage for the requested site, gets all translations of that page and iterates through all the children of each.
From this, we render a TemplateResponse
using the customised template and the urlset
. To this, we add an X-Robots-Tag
header entry to tell search engines not to index or archive this page. We also calculate the most recent update from the listed pages and use this as the last-modified
header value.
urlset.remove([])
pops out any empty arrays returned from theget_sitemap_urls
methods - this method throws an error if the value is not found (ironically) so it's wrapped in atry
/except
.
I have a core
app on my sites where I keep such code, adjust to your own site:
from datetime import datetime
from django.template.response import TemplateResponse
from django.utils.http import http_date
from wagtail.models import Page, Site
def sitemap(request):
site = Site.find_for_request(request)
urlset = []
# Find root page for the site and translations of root page
for locale_home in (
site.root_page.get_translations(inclusive=True)
.live()
.defer_streamfields()
.specific()
):
# For every page in each locale tree, add url set entry
for page in (
locale_home.get_descendants(inclusive=True)
.live()
.defer_streamfields()
.specific()
):
if page.search_engine_index:
urlset.append(page.get_sitemap_urls())
# strip any empty entries
try:
urlset.remove([])
except:
pass
try:
# get the last_modified value for the sitemap header
last_modified = max([x['lastmod'] for x in urlset])
except Exception as e:
# either urlset is empty or lastmod fields not present, set last modified to now
print(f"\n{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}\n")
last_modified = datetime.now()
return TemplateResponse(
request,
template='sitemap.xml',
context={'urlset': urlset},
content_type='application/xml',
headers={
"X-Robots-Tag": "noindex, noodp, noarchive",
"last-modified": http_date(last_modified.timestamp()),
"vary": "Accept-Encoding",
}
)
defer_streamfields()
Apply to a queryset to prevent fetching/decoding of StreamField values on evaluation. Useful when working with potentially large numbers of results, where StreamField values are unlikely to be needed. For example, when generating a sitemap or a long list of page links.
Amend the sitemap urls.py entry
In your site urls.py
, remove the import for the Wagtail sitemap view (if you were using it), import the one you just created and add:
from core.views import sitemap
from django.urls import re_path
...
urlpatterns = [
...
re_path(r'^sitemap.xml$', sitemap, name='sitemap'),
...
]
You can also remove Wagtail and Django sitemaps from your INSTALLED_APPS
list now.
Example Output
I'll use a test site that has English and Spanish enabled with home page, blog index and three blog pages. All pages except the third blog page have been translated.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>http://localhost:8000/en/</loc>
<lastmod>2022-09-04</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/" />
</url>
<url>
<loc>http://localhost:8000/en/tech-blog/</loc>
<lastmod>2022-12-21</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/" />
</url>
<url>
<loc>http://localhost:8000/en/tech-blog/first-tech-blog/</loc>
<lastmod>2022-09-26</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
</url>
<url>
<loc>http://localhost:8000/en/tech-blog/second-tech-blog/</loc>
<lastmod>2022-12-21</lastmod>
<changefreq>weekly</changefreq>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
</url>
<url>
<loc>http://localhost:8000/en/tech-blog/third-tech-blog/</loc>
<lastmod>2022-12-16</lastmod>
</url>
<url>
<loc>http://localhost:8000/es/</loc>
<lastmod>2022-08-16</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/" />
</url>
<url>
<loc>http://localhost:8000/es/tech-blog/</loc>
<lastmod>2022-12-21</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/" />
</url>
<url>
<loc>http://localhost:8000/es/tech-blog/primer-tech-blog/</loc>
<lastmod>2022-09-06</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/primer-tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/first-tech-blog/" />
</url>
<url>
<loc>http://localhost:8000/es/tech-blog/segundo-tech-blog/</loc>
<lastmod>2022-08-18</lastmod>
<xhtml:link rel="alternate" hreflang="en" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
<xhtml:link rel="alternate" hreflang="es" href="http://localhost:8000/es/tech-blog/segundo-tech-blog/" />
<xhtml:link rel="alternate" hreflang="x-default" href="http://localhost:8000/en/tech-blog/second-tech-blog/" />
</url>
</urlset>
Notify Google of Sitemap Updates
Google have announced that they will be deprecating the ping API at the end of 2023. Other search engines may continue their ping API's, adjust for any that you might want to use.
To submit your sitemap to Google programmatically, you can send a get request to the ping tool in the format:
- https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP
You can do this either on page publish or delete, set this as a cron job to run periodically, or just add a button to your admin site to run this manually.
I've just used Google in the example below, add other search engines as an array and loop through this for a more thorough approach.
First, I create the method to ping Google with the sitemap URL, saved to utils.py
in my core
app:
from urllib.parse import urlencode
from urllib.request import urlopen
from django.urls import reverse
PING_URL = "https://www.google.com/webmasters/tools/ping"
def ping_google(request, ping_url=PING_URL):
try:
sitemap = request.build_absolute_uri(reverse('sitemap'))
params = urlencode({"sitemap": sitemap})
urlopen(f"{ping_url}?{params}")
except Exception as e:
print(f"{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}")
reverse('sitemap')
assumes I have named my sitemap view in urls.py
with 'sitemap'
:
re_path(r'^sitemap.xml$', sitemap, name='sitemap')
Adjust for your site if necessary.
For a Wagtail site publishing occasionally, I'll ping Google each time a page is updated, published or deleted using hooks. Be sure to add the test for debug to limit this to your production site:
from django.conf import settings
from wagtail import hooks
from .utils import ping_google
@hooks.register('after_delete_page')
def do_after_delete_page(request, page):
if not settings.DEBUG:
ping_google(request)
@hooks.register("after_publish_page")
def do_after_publish(request, page):
if not settings.DEBUG:
ping_google(request)
Conclusion
In this guide, we explored the intricacies of dynamic sitemap creation in Wagtail, leveraging both the built-in sitemap app and custom views. Key highlights include:
- Incorporating entries for routeable pages to enhance sitemap comprehensiveness.
- Effectively hiding specific pages from the sitemap for tailored SEO strategies.
- Fine-tuning the
lastmod
field and introducing values forchangefreq
andpriority
fields at both site-wide and class levels. - Empowering custom fields within mixins, enabling per-page adjustments to enhance flexibility.
- Crafting sitemaps for multilingual sites via a bespoke view, ensuring inclusivity.
- Proactively notifying Google of sitemap updates for optimal search engine visibility.
By mastering these techniques, you're equipped to sculpt precise and dynamic sitemaps tailored to your Wagtail projects, optimizing search engine interactions and bolstering your site's overall SEO performance.