
AI-supported search within websites – using only that site’s content to answer user queries – is here. Soapbox is developing this functionality and adding it to sites, and it is radically altering search-result relevance and user engagement.
Traditional searches built into content management systems (eg, WordPress or Drupal) find content related to a visitor’s keyword search. They will match on those keywords, counting their frequency, and applying basic weighting to score results higher when those words appear in headings, for example.
This works fine if you are a site editor: you know the organisational terminology, so you can find decent results. This all starts to fall apart when end users of your website try to find the content they need and do not have that ‘internal’ knowledge.
Search engines like Apache SOLR help with basic synonyms, and can even account for typos using a technique called fuzziness/sloppiness, in which inexact matches can still return results. Even with these tools though, users are still likely to miss out on the most relevant content.
For instance, what if you have a very relevant article but it is short and keyword frequency is very low? Whereas you have a longer article (eg, an annual report) with plenty of keyword matches but much less relevance for a specific visitor’s query? The longer piece is much more likely to appear in standard search results in your website.
What about visitors to your site – perhaps younger audiences – who are becoming more accustomed to searching by asking full sentences as questions? It’s easy to see how things start to unravel, and how visitors might end up back at Google to search your site. This means that:
- Your carefully curated filtering and categorisation efforts are wasted
- Your visitor may end up at a competitor’s site instead
Enter: vector databases
This is where the content item gets ‘vectorised’ into thousands of dimensions, each pointing in the direction of a multitude of topics, weakly or strongly, using the extensive training of a large language model (LLM), such as OpenAI. The keywords a user enters are then also vectorised and some complicated mathematics (eg, cosine/Euclidean distance) is done by the vector database to retrieve the top results.
Since this approach points content in the direction of topics, similar meanings of words have similar distances from each other, meaning that the user does not need to use the same terminology as the content of the website. Other approaches require the user to know the terms you generally use.
Soapbox developers have contributed heavily to AI initiatives, particularly in the Drupal community, and co-maintain the AI Search functionality. We have refined strategies to effectively vectorise the relevant bits of think tank content, factoring in recency, and maintaining your existing filtering to vastly improve search relevance.
AI overviews
As well as the work on relevance, we have also been developing an AI Overview functionality – of the type now seen in Google search results, yet answered only by your content.
Take a look at how we did this for the Nuffield Trust.
Here we use something like the top three to six highly relevant results from the vector database to ask a question of an LLM. It produces a very accurate answer with a far lower risk of ‘hallucination’ since the LLM is specifically given the content to answer from, is told not to answer if it does not find the answer, and is only given relevant results. The answer is clearly marked as ‘AI generated’, and the visitor is prompted to read the original content items used to generate it.
This is a new approach to search, and one we highly recommend.
- AI summary answers are very useful in helping visitors to see they have come to the right site to find the information they are seeking.
- Effective summaries discourage users from going to look elsewhere – as they might do if the title and summary in traditional search results do not give them confidence that their question will be answered by your content.
And more…
Vector databases have other uses, such as finding far more relevant related content to encourage users to read more, or improvements to the site editing experience in the content management system to suggest relationships between content.
Get in touch with us to find out more.