Search code examples
postgresqlfull-text-search

Full text search configuration on PostgreSQL


I'm facing an issue concerning the text search configuration on PostgreSQL. I have a table users which contains a column name. The name of users maybe a French, English, Spanish or any other language. So I need to use the Full Text Search of PostgreSQL. The default text search configuration I'm using now is the simple configuration but is not efficient to make the search and get the suitable results.

I'm trying to combine different text search configuration like this:

(to_tsvector('english', document) || to_tsvector('french', document) || to_tsvector('spanish', document) || to_tsvector('russian', document)) @@
(to_tsquery('english', query) || to_tsquery('french', query) || to_tsquery('spanish', query) || to_tsquery('russian', query))

But this query didn't give suitable results, if we test for example:

select (to_tsvector('english', 'adam and smith') || to_tsvector('french', 'adam and smith') || to_tsvector('spanish', 'adam and smith') || to_tsvector('russian', 'adam and smith')) 

tsvector: 'adam':1,4,7,10 'and':5,8 'smith':3,6,9,12

Using the origin language of the word:

select (to_tsvector('english', 'adam and smith')) 
tsvector: 'adam':1 'smith':3

The first thing to mention that the stopwords were not token into consideration when we combine different configuration with || operator. Is there any solution to combine different text search configuration and use the suitable language when a user search a text?


Solution

  • Maybe you think that || is an “or” operator, but it concatenates text search vectors.

    Take a look at what happens in your expression.

    Running \dF+ french in psql will show you that for asciiwords, a French Snowball stemmer is used. That removes stop words and reduces the words to their stem. Similar for English and Russian.

    You can use ts_debug to see this in operation:

    test=> SELECT * FROM ts_debug('english', 'adam and smith');
       alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
    -----------+-----------------+-------+----------------+--------------+---------
     asciiword | Word, all ASCII | adam  | {english_stem} | english_stem | {adam}
     blank     | Space symbols   |       | {}             |              | 
     asciiword | Word, all ASCII | and   | {english_stem} | english_stem | {}
     blank     | Space symbols   |       | {}             |              | 
     asciiword | Word, all ASCII | smith | {english_stem} | english_stem | {smith}
    (5 rows)
    
    test=> SELECT * FROM ts_debug('french', 'adam and smith');
       alias   |   description   | token | dictionaries  | dictionary  | lexemes 
    -----------+-----------------+-------+---------------+-------------+---------
     asciiword | Word, all ASCII | adam  | {french_stem} | french_stem | {adam}
     blank     | Space symbols   |       | {}            |             | 
     asciiword | Word, all ASCII | and   | {french_stem} | french_stem | {and}
     blank     | Space symbols   |       | {}            |             | 
     asciiword | Word, all ASCII | smith | {french_stem} | french_stem | {smith}
    (5 rows)
    

    Now if you concatenate these four tsvectors, you end up with adam in position 1, 4, 7 and 10.

    There is no good way to use full text search for different languages at once.

    But if it is really personal names you are searching, I would do the following:

    Create a text search configuration with a simple dictionary for asciiwords, and either use an empty stopword file for the dictionary or one that contains stopwords that are acceptable in all languages.

    Personal names normally should not be stemmed, so you avoid that problem. And if you miss a stopword, that's no big deal. It only makes the resulting tsvector (and index) larger, but with personal names there should not be too many stopwords anyway.