Search code examples
postgresqltsvector

Should I store empty tsvector values or NULL values?


When storing a tsvector value in a column, for records with no search terms, should I store an empty tsvector or a NULL value?

Does it matter?

Is there any difference in terms of performance, or storage overhead from storing empty vectors?

In other words, when updating the vector based on a value of, say, a nullable title column, do I need to always compute this as to_tsvector(coalesce(title,'')) (since to_tsvector returns NULL when given a NULL argument) or is it enough to do to_tsvector(title)?


Solution

  • Logic aspects of your question

    First off, the semantics of SQL NULL is that of UNKNOWN, whereas some data types also have an "empty" value. Those data types include:

    • TEXT ('' isn't the same as NULL::TEXT)
    • JSON and JSONB ([] or {} aren't the same as NULL::JSON or NULL:JSONB)
    • X[] (ARRAY[]::X[] isn't the same as NULL::X[])

    There are many more, including TSVECTOR. The semantics of an empty collection of something is always subtly different from that of a NULL value, which is an UNKNOWN collection (often just used as an absent collection, though). The distiction manifests specifically when it comes to using operators, e.g.

    • '' || 'abc' = 'abc' but NULL || 'abc' IS NULL
    • to_tsvector('cats ate rats') @@ to_tsquery('cat & rat') = true but NULL @@ to_tsquery('cat & rat') IS NULL

    In that sense, the decision should be foremost a logic one, not a storage one, based on this question: Will you still work with the TSVECTOR value of a record, even if the record doesn't have any search terms (pro empty TSVECTOR)? Or does the feature not apply at all to that particular record (pro NULL value)? For the @@ operator, it may not be so relevant, but it definitely is for the || operator, and others.

    The answer isn't obvious, nor is there a clear right / wrong way in general.

    Performance aspects of your question

    If this is a highly performance sensitive situation in your application (e.g. you have a lot of empty TSVECTOR values), then maybe, this benchmark could help you with the decision?

    I ran the below benchmark on PostgreSQL 14.1 in Docker to get this result:

    RUN 1, Statement 1: 2.91145
    RUN 1, Statement 2: 1.00000 -- The fastest run is 1. The others are multiples of 1
    RUN 2, Statement 1: 2.80509
    RUN 2, Statement 2: 1.05232
    RUN 3, Statement 1: 2.78001
    RUN 3, Statement 2: 1.00202
    RUN 4, Statement 1: 2.74319
    RUN 4, Statement 2: 1.00524
    RUN 5, Statement 1: 2.75808
    RUN 5, Statement 2: 1.00045
    
    • Statement 1 is SELECT v @@ to_tsquery('cat & rat') with v tsvector = to_tsvector('');
    • Statement 2 is SELECT NULL @@ to_tsquery('cat & rat')

    The fact that NULL is involved probably leads to a shortcut in the @@ operator's algorithm, which produces a 2.7x performance improvement over querying an empty TSVECTOR in the benchmark. So, there do seem to be benefits of using NULL in terms of performance.

    Obviously, that's just a benchmark, which doesn't necessarily reflect real-world use-cases, but it should give you a hint of a potential difference.

    Benchmark code

    For reproduction or adaptations, here's a benchmark, based on this technique.

    DO $$
    DECLARE
      v_ts TIMESTAMP;
      v_repeat CONSTANT INT := 10000;
      rec RECORD;
      run INT[];
      stmt INT[];
      elapsed DECIMAL[];
      min_elapsed DECIMAL;
      i INT := 1;
    
      -- Store the vector in a local variable to avoid re-computing it in the benchmark
      v tsvector = to_tsvector('');
    BEGIN
    
      -- Repeat the whole benchmark several times to avoid warmup penalty
      FOR r IN 1..5 LOOP
        v_ts := clock_timestamp();
    
        FOR i IN 1..v_repeat LOOP
          FOR rec IN (
            -- Statement 1
            SELECT v @@ to_tsquery('cat & rat')
          ) LOOP
            NULL;
          END LOOP;
        END LOOP;
    
        run[i] := r;
        stmt[i] := 1;
        elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP)) 
                     - EXTRACT(EPOCH FROM v_ts));
        i := i + 1;
        v_ts := clock_timestamp();
    
        FOR i IN 1..v_repeat LOOP
          FOR rec IN (
            -- Statement 2
            SELECT NULL @@ to_tsquery('cat & rat') 
          ) LOOP
            NULL;
          END LOOP;
        END LOOP;
    
        run[i] := r;
        stmt[i] := 2;
        elapsed[i] := (EXTRACT(EPOCH FROM CAST(clock_timestamp() AS TIMESTAMP))
                     - EXTRACT(EPOCH FROM v_ts));
        i := i + 1;
      END LOOP;
    
      SELECT min(t.elapsed)
      INTO min_elapsed
      FROM unnest(elapsed) AS t(elapsed);
    
      FOR i IN 1..array_length(run, 1) LOOP
        RAISE INFO 'RUN %, Statement %: %', run[i], stmt[i], 
          CAST(elapsed[i] / min_elapsed AS DECIMAL(10, 5));
      END LOOP;
    END$$;