Add Unaccent Support in PostgreSQL Search
Introduction
PostgreSQL has a some pretty useful NLP search capabilities out of the box with stemming and stop words already built-in. However, PostgreSQL search backends will only load the character set native to the language you've set your backend to use.
On those languages that use the extended Latin character set, PostgreSQL uses a feature called unacccent to match a base letter with it's accented variants.
This is a very useful feature that enables rapid search entry and assists those users who don't have access to those special characters on their keyboards.
I show a quick way to load the full extended character set unaccenting into a custom English backend and take you through how to verify that the search is indexing and returning results as desired from the command prompt. Finally, I'll use the custom backend for my Wagtail site search to return pages with unaccented queries.
Unaccent
The unaccent text search dictionary enables accent-insensitive full-text search processing and removes accents (diacritic marks) from individual words when indexing and when parsing search terms. Unaccent is a filtering dictionary, which implies that its output is always sent to the following dictionary. When configuring the custom backend, it is placed ahead of the language dictionary.
For example, in French, submitting a query with the word 'hotel' will match with the word 'hôtel'. Doing the same with the English backend will result in no match as unaccent is not enabled by default.
Create the Unaccent Search Configuration
The aim is to duplicate the English search configuration with associated dictionary, stemming and stop words, but to add the unaccent capability.
First, open a psql
prompt with admin privilege, for example:
$ sudo psql -U db_admin -W -h localhost -d mydb
Then create the new search configuration. I'll call it english_extended
for this example:
CREATE TEXT SEARCH CONFIGURATION english_extended ( COPY = english );
ALTER TEXT SEARCH CONFIGURATION english_extended
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, english_stem;
Add Search Configuration to Default Backend
Depending on what you're indexing, you will need to specify your backend to use this new config.
For a Wagtail site, this would mean amending your base.py
file with the following:
WAGTAILSEARCH_BACKENDS = {
'default': {
'BACKEND': 'wagtail.search.backends.database',
'SEARCH_CONFIG': 'english_extended',
},
}
Testing the Unaccent Search Configuration
For testing, I create a test Wagtail site with an indexed page that includes the title Æsop’s über cañon álphàbêţ at the Bodø Strandå Hôtel.
First, to verify expected behaviour, I search with the default English config with the term 'bodø' and 'bodo' to show the unaccented term will not match. At the psql
prompt:
select title from wagtailcore_page where to_tsvector('english', title) @@ to_tsquery('bodø');
title
----------------------------------------------------
Æsop’s über cañon álphàbêţ at the Bodø Strandå Hôtel
(1 row)
select title from wagtailcore_page where to_tsvector(‘english’, title) @@ to_tsquery('bodo');
title
-------
(0 rows)
As expected, the unaccented term does not match with the default English configuration. Now we try with the unaccented term and the extended configuration we just created:
select title from wagtailcore_page where to_tsvector('english_extended', title) @@ to_tsquery('bodo');
title
----------------------------------------------------
Æsop’s über cañon álphàbêţ at the Bodø Strandå Hôtel
(1 row)
The english_extended
configuration successfully matches the unaccented search term.
What about searching with the original term (bodø)? Does this still match?
select title from wagtailcore_page where to_tsvector('english_extended', title) @@ to_tsquery('bodø');
title
----------------------------------------------------
Æsop’s über cañon álphàbêţ at the Bodø Strandå Hôtel
(1 row)
Unaccenting is being parsed for both indexing and search terms.
wagtailsearch_indexentry
table.
For a Wagtail Search Backend, you can view the stored index in the The indexed entry for our test page is
- 'aesop':1B 'alphabet':5B 'bodo':8B 'canon':4B 'hotel':10B 'stranda':9B 'uber':3B
Test Stemming with Unaccent
One more check to make sure that stop words are being stripped and stem words being stored. I create a test page with the title “We are going walking in the mountains”.
The indexed title appears as expected with stop words omitted and stem words:
'go':3B 'mountain':7B 'today':8B 'walk':4B
The english_extended
search config is enabled on this site. You can search unaccented terms in the menu bar search such as aesop, uber, canon, alphabet, etc. and you will see this page returned in the results (the page body is also indexed).
Testing from a shell_plus
prompt:
In [1]: from wagtail.models import Page
In [2]: from wagtail.search.backends import get_search_backend
In [3]: scope=Page.objects.live()
In [4]: s=get_search_backend()
In [5]: s.config
Out[5]: 'english_extended'
In [6]: s.search('aesop', scope)
Out[6]: <SearchResults [<Page: Æsop’s über cañon álphàbêţ at the Bodø Strandå Hôtel>]>
Unfortunately, although the Wagtail documentation states it's possible to add multiple backends, it is not possible to use multiple Wagtail PostgreSQL backends with differing configs. All that happens is the index gets written over for each backend in the list with the final one winning out.
WAGTAILSEARCH_BACKENDS = {
'default': {
'BACKEND': 'wagtail.search.backends.database',
'SEARCH_CONFIG': 'english_extended',
},
'es': {
'BACKEND': 'wagtail.search.backends.database',
'SEARCH_CONFIG': 'spanish',
},
'fr': {
'BACKEND': 'wagtail.search.backends.database',
'SEARCH_CONFIG': 'french',
},
}
In this case, the site will be indexed with the French search config only.
To use only the Wagtail PostgreSQL backend, the best you can do is extend the 'simple' search config which has stop words and stemming disabled to make it language neutral. You lose some of the NLP features but maintain a searchable site. You can filter returned results based on the activated language of the request to maintain relevance.
Conclusion
As you can see from the code above, adding unaccent support in PostgreSQL is straightforward and easy to apply to your Wagtail search configuration.
This technique will work for other languages that you need to extend the built-in unaccent rules for. Just swap out the relevant language keywords in the example given above.
Extend the 'simple' dictionary on multi-lingual sites to overcome the limitation of one index.