When the subject is search engine (inverted index) it is extremely common to think of solutions like Apache Solr or Elastic (former elasticsearch), but why don't we remember the Full Text Search feature we have native in PostgreSQL?
The full text search feature in PostgreSQL became native in version 9.4, but before that we had the GiST Indexes, which was used as the basis for the development of Full Text Search (tsvector).
What Is Full Text Search Anyway?
(...) full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database; (...) distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). Wikipedia
In other words, imagine you have a set of text documents stored in a database. These documents are not just meta-data items like an author name or a country of origin, but rather an abstract for an article, or full text articles themselves, and you want to find out if certain words are present or not in them.
E.g. you want to search all the news that has subistantive related to "dog" or "fox"(we are talking about animals) are present so if they are in their singular form, you'll find them with the ILIKE keyword...
... but you'll also find stuff like "foxtrot" or "Dogville", which is not quite what you intended, you had to declare in which field you would search, if you did not have the terms searched in the field will not bring record.
Another problem is that if you search for a word such as "query", and if it's present in its plural form "queries", then you won't find it if you try a simple pattern search with LIKE, even though the word is, in fact, there. Some of you might be thinking to use regular expressions, and yes, you could do that, regular expressions are incredibly powerful, but also terribly slow.
A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).
That is, in a nutshell, the principle of full text search, thinking about problems related to this was developed data type tsvector.
What do you need to know to use tsvector?
PostgreSQL has two functions that do exactly what we intend to do:
to_tsvector for creating a list of tokens (the tsvector data type, where ts stands for "text search");
to_tsquery for querying the vector for occurrences of certain words or phrases.
For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog":
SELECTto_tsvector('The quick brown fox jumped over the lazy dog.');
Which will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted:
By default, every word is normalized as a lexeme in _English+ (e.g. "jumped" becomes "jump"), case depending on the localization settings of your PostgreSQL installation.
A common doubt is about support for other languages (e.g. Brazilian Portuguese). The good news is that we have support for several languages, see the list:
The way it is when creating a record will not be created the tokens to do textual search, how will we solve this?
We can create a trigger that listens to all the creation and updating, joins the text of the title and description and generates the tokens automatically:
It is now possible to do text search using the full text search feature (function to_tsquery):
SELECT*FROMnewsWHEREtokens@@to_tsquery('fox');
Return:
id | title | description | created_at | last_update | client_id | tokens
----+-------------+----------------------------------------------+----------------------------+----------------------------+-----------+------------------------------------------------------------------------
1 | dog history | The quick brown fox jumped over the lazy dog | 2020-12-11 15:58:55.298558 | 2021-01-02 16:05:21.088482 | 1 | 'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3
(1 row)
PostgreSQL ➕ REST, low-code, simplify and accelerate development, ⚡ instant, realtime, high-performance on any Postgres application, existing or new
pRESTd
pREST (PostgreSQLREST), is a simple production-ready API, that delivers an instant, realtime, and high-performance application on top of your existing or new Postgres database.
PostgreSQL version 9.5 or higher
Contributor License Agreement -
Problems we solve
The pREST project is the API that addresses the need for fast and efficient solution in building RESTful APIs on PostgreSQL databases. It simplifies API development by offering:
A lightweight server with easy configuration;
Direct SQL queries with templating in customizable URLs;
Optimizations for high performance;
Enhanced developer productivity;
Authentication and authorization features;
Pluggable custom routes and middlewares.
Overall, pREST simplifies the process of creating secure and performant RESTful APIs on top of your new or old PostgreSQL database.
This feature was implemented in version 1.0.5 as a search filter and works like this:
GET /{DATAVASE}/{SCHEMA}/news?tokens:tsquery=fox
[{"id":1,"title":"dog history","description":"The quick brown fox jumped over the lazy dog","created_at":"2020-12-11T15:58:55.298558","last_update":"2021-01-02T16:05:21.088482","client_id":1,"tokens":"'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3"}]