When the subject is search engine (inverted index) it is extremely common to think of solutions like Apache Solr or Elastic (former elasticsearch), but why don't we remember the Full Text Search feature we have native in PostgreSQL?

The full text search feature in PostgreSQL became native in version 9.4, but before that we had the GiST Indexes, which was used as the basis for the development of Full Text Search (tsvector).

What Is Full Text Search Anyway?

(...) full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database; (...) distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). Wikipedia

In other words, imagine you have a set of text documents stored in a database. These documents are not just meta-data items like an author name or a country of origin, but rather an abstract for an article, or full text articles themselves, and you want to find out if certain words are present or not in them.

E.g. you want to search all the news that has subistantive related to "dog" or "fox" (we are talking about animals) are present so if they are in their singular form, you'll find them with the ILIKE keyword...

SELECT * FROM news  
WHERE  
(body ILIKE '%fox%' OR
body ILIKE '%dog%');

... but you'll also find stuff like "foxtrot" or "Dogville", which is not quite what you intended, you had to declare in which field you would search, if you did not have the terms searched in the field will not bring record.

Another problem is that if you search for a word such as "query", and if it's present in its plural form "queries", then you won't find it if you try a simple pattern search with LIKE, even though the word is, in fact, there. Some of you might be thinking to use regular expressions, and yes, you could do that, regular expressions are incredibly powerful, but also terribly slow.

A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).

That is, in a nutshell, the principle of full text search, thinking about problems related to this was developed data type tsvector.

What do you need to know to use tsvector?

PostgreSQL has two functions that do exactly what we intend to do:

to_tsvector for creating a list of tokens (the tsvector data type, where ts stands for "text search");
to_tsquery for querying the vector for occurrences of certain words or phrases.

For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog":

SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');

Which will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted:

                      to_tsvector
-------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

By default, every word is normalized as a lexeme in _English+ (e.g. "jumped" becomes "jump"), case depending on the localization settings of your PostgreSQL installation.

A common doubt is about support for other languages (e.g. Brazilian Portuguese). The good news is that we have support for several languages, see the list:

SELECT cfgname FROM pg_ts_config;

cfgname
--------
simple
arabic
danish
dutch
english
finnish
french
german
hungarian
indonesian
irish
italian
lithuanian
nepali
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish

Deeper into the PostgreSQL text search configuration.

Real example generating tokens

We have news table with title field, description and other meta data:

CREATE TABLE news (
    id SERIAL PRIMARY KEY,
    title text NOT NULL,
    description text NOT NULL,
    created_at timestamp without time zone NOT NULL DEFAULT now(),
    last_update timestamp without time zone NOT NULL DEFAULT now(),
    tokens tsvector
);

The way it is when creating a record will not be created the tokens to do textual search, how will we solve this?

We can create a trigger that listens to all the creation and updating, joins the text of the title and description and generates the tokens automatically:

CREATE OR REPLACE FUNCTION set_full_text_search_on_news()
RETURNS TRIGGER AS $$
BEGIN
       NEW.tokens = to_tsvector(concat(NEW.title, ", ", NEW.description))
       RETURN NEW;
END;
$$ language 'plpgsql';

CREATE TRIGGER update_new_full_text_search
       BEFORE UPDATE ON news
       FOR EACH ROW EXECUTE PROCEDURE set_full_text_search_on_news();

Inserting a Record:

INSERT INTO news (title, description)
VALUES
('dog history', 'The quick brown fox jumped over the lazy dog');

The tokens field will be generated automatically, thus remaining:

'caolha':4 'da':2 'de':5 'em':12 'fax':9 'gigant':11 'história':1 'java':6 'mandar':8 'new':13 'para':10 'querem':7 'york':14 'zebrazebra':3

It is now possible to do text search using the full text search feature (function to_tsquery):

SELECT * FROM news WHERE tokens @@ to_tsquery('fox');

Return:

 id |    title    |                 description                  |         created_at         |        last_update         | client_id |                                 tokens
----+-------------+----------------------------------------------+----------------------------+----------------------------+-----------+------------------------------------------------------------------------
  1 | dog history | The quick brown fox jumped over the lazy dog | 2020-12-11 15:58:55.298558 | 2021-01-02 16:05:21.088482 |         1 | 'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3
(1 row)

How to use this feature in pREST?

prest / prest

PostgreSQL ➕ REST, low-code, simplify and accelerate development, ⚡ instant, realtime, high-performance on any Postgres application, existing or new

pRESTd

pREST (PostgreSQL REST), is a simple production-ready API, that delivers an instant, realtime, and high-performance application on top of your existing or new Postgres database.

PostgreSQL version 9.5 or higher

Contributor License Agreement -

Problems we solve

The pREST project is the API that addresses the need for fast and efficient solution in building RESTful APIs on PostgreSQL databases. It simplifies API development by offering:

A lightweight server with easy configuration;
Direct SQL queries with templating in customizable URLs;
Optimizations for high performance;
Enhanced developer productivity;
Authentication and authorization features;
Pluggable custom routes and middlewares.

Overall, pREST simplifies the process of creating secure and performant RESTful APIs on top of your new or old PostgreSQL database.

Why we built pREST

When we built pREST, we originally intended to contribute and build with the PostgREST project, although it took a lot…

View on GitHub

This feature was implemented in version 1.0.5 as a search filter and works like this:

GET /{DATAVASE}/{SCHEMA}/news?tokens:tsquery=fox

[
    {
        "id": 1,
        "title": "dog history",
        "description": "The quick brown fox jumped over the lazy dog",
        "created_at": "2020-12-11T15:58:55.298558",
        "last_update": "2021-01-02T16:05:21.088482",
        "client_id": 1,
        "tokens": "'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3"
    }
]

Read more about pREST search filters here.

Full Text Search in PostgreSQL