Search code examples
postgresqlfull-text-searchpostgresql-9.6tsvector

Adding multiple phrases together in phraseto_tsquery


I have been successfully able to concat arrays of single words into strings for to_tsquery but phraseto_tsquery in postgres 9.6 only allows one keyword phrase. Does anyone know of a solution to query a tsvector (whether in Sql or full-text-search features) in such a way where I can (OR/AND) a dynamic amount of phrases into a query. The select blocks are all arrays of text.

First tries:

SELECT to_tsvector('english','Try not to become a man of successful companies, but rather try to become a man of value')
   @@ (to_tsquery('english','man & become')
       && phraseto_tsquery('english','man of value')
       && phraseto_tsquery('english','company')
       || phraseto_tsquery('english', 'company | man of value')
   );

Example of the real world problem searching for animals:

-- with statements here of opp_tsv and tp
SELECT
  tp.id,
  tp.keywords, --['giraffes','lions', 'monkeys']
  tp.phrase_keywords, --['pygmy marmocet','African Lion']
  tp.neg_keywords, --['aliens', 'spaceships', 'space']
  tp.neg_phrase_keywords --['Andromedan Alien', 'Nibiru Reptilian']
FROM tp, opp_tsv,
  -- string logic for ts_query
      concat(array_to_string(tp.keywords, ' | ')) AS kws_concat,
      concat(array_to_string(tp.neg_keywords, ' | ')) AS     neg_kws_concat,
      to_tsquery('english', kws_concat) query,
      to_tsquery('english', concat(neg_kws_concat)) neg_query
  -- Case logic for phrase queries

  -- .... -> phrase_query,
      phraseto_tsquery('phrase to search | Need this phrase too')
  -- .... -> phrase_neg_query,

WHERE
  (
    opp_tsv.doc @@ query --pos
    OR
    opp_tsv.doc @@ phrase_query --pos
  )
  AND NOT (
    opp_tsv.doc @@ neg_query --neg
    OR
    opp_tsv.doc @@ phrase_neg_query --neg
  )
ORDER BY rank_cd DESC;

Thoughts: generate dynamically according to array length

opp_tsv.doc @@ (phrase_query || phrase_query2)

or achieve this somehow

opp_tsv.doc @@ phraseto_tsquery('big messy phrase | more messy wordphrases')

EDIT: SELECT phraseto_tsquery('phrase to search | Need this phrase too') result = 'phrase' <-> 'to' <-> 'search' <-> 'need' <-> 'this' <-> 'phrase' <-> 'too' What I am looking for is the result of 'phrase<->to<->search' | 'need<->this<->phrase<->too'


Solution

  • You can define your own aggregate over tsquery's or (||) operator:

    CREATE AGGREGATE tsquery_or_agg(tsquery) (
      SFUNC = tsquery_or,
      STYPE = tsquery
    );
    

    Note: the aggregate above relies on the fact that tsquery's || operator is backed by the tsquery_or(tsquery, tsquery) function. You can check that with:

    SELECT *
    FROM   pg_operator
    WHERE  oprname  = '||'
    AND    oprleft  = regtype 'tsquery'
    AND    oprright = regtype 'tsquery';
    

    If you don't want to rely on this (undocumented) function's name (even if it's unlikely to be changed), you can create your own function to serve as the base function (SFUNC) for your aggregate:

    CREATE FUNCTION my_tsquery_or(tsquery, tsquery)
      RETURNS tsquery
      LANGUAGE sql
      IMMUTABLE
      STRICT
      AS 'SELECT $1 || $2';
    

    After that, your query will be something like:

    WITH tp(id, keywords, phrase_keywords, neg_keywords, neg_phrase_keywords ) AS (
      VALUES (42, ARRAY['giraffes', 'lions', 'monkeys']::text[],
                  ARRAY['pygmy marmocet', 'African Lion']::text[],
                  ARRAY['aliens', 'spaceships', 'space']::text[],
                  ARRAY['Andromedan Alien', 'Nibiru Reptilian']::text[])
    ),
    tq(id, query) AS (
      SELECT   tp.id,
               (((SELECT tsquery_or_agg(plainto_tsquery(kw)) FROM unnest(keywords) kw) ||
                 (SELECT tsquery_or_agg(phraseto_tsquery(pk)) FROM unnest(phrase_keywords) pk)) &&
                 !!((SELECT tsquery_or_agg(plainto_tsquery(nk)) FROM unnest(neg_keywords) nk) ||
                    (SELECT tsquery_or_agg(phraseto_tsquery(np)) FROM unnest(neg_phrase_keywords) np)))
      FROM     tp
    ),
    opp_tsv(doc) AS (
      VALUES (to_tsvector('Earth''s African Lions')),
             (to_tsvector('Andromedan Alien''s space monkeys'))
    )
    SELECT   tp.id,
             tp.keywords,
             tp.phrase_keywords,
             tp.neg_keywords,
             tp.neg_phrase_keywords,
             opp_tsv.doc
    FROM     opp_tsv, tp
    JOIN     tq USING (id)
    WHERE    opp_tsv.doc @@ tq.query
    ORDER BY ts_rank_cd(opp_tsv.doc, tq.query) DESC;
    

    Also, if fields in tp can contain phrases like 'big messy phrase | more messy wordphrases', then you didn't split your input properly in the first place. You can split such phrases/keywords with the regexp_split_to_table() function. With that, the tq CTE should look something like:

    tq(id, query) AS (
      SELECT   tp.id,
               (((SELECT tsquery_or_agg(plainto_tsquery(kw)) FROM unnest(keywords) kwb, regexp_split_to_table(kwb, '\|') kw) ||
                 (SELECT tsquery_or_agg(phraseto_tsquery(pk)) FROM unnest(phrase_keywords) pkb, regexp_split_to_table(pkb, '\|') pk)) &&
                 !!((SELECT tsquery_or_agg(plainto_tsquery(nk)) FROM unnest(neg_keywords) nkb, regexp_split_to_table(nkb, '\|') nk) ||
                    (SELECT tsquery_or_agg(phraseto_tsquery(np)) FROM unnest(neg_phrase_keywords) npb, regexp_split_to_table(npb, '\|') np)))
      FROM     tp
    ),