Search code examples
postgresqlquery-performancecase-insensitiveb-tree-index

How is PostgreSQL citext stored in a b-tree index? Lower case or as it is?


I am using citext in PostgreSQL for all text column types. I wonder about citext performance.

I performed simple WHERE statement benchmarks over text columns that have a b-tree index, but I couldn't see any differences in terms of query cost.

For example:

Select * From table_text where a = '1';

Select * From table_citext where a= '1';

These queries have identical query costs.

As far as I understand, citext stores the string as it is without converting it to lower case. So when a value is used in the WHERE clause, it uses the lower function for every comparison in each node of the b-tree index (I used a b-tree index).

If this is as I say, this should have caused performance problems, but it hasn't.

How does PostgreSQL achieve this?
How does PostgreSQL store citext column values in a b-tree index?


Solution

  • citext is stored as it is input, without any conversion to lower case. This also holds for storage as b-tree index keys.

    The magic happens in the comparison function for citext:

    /*
     * citextcmp()
     * Internal comparison function for citext strings.
     * Returns int32 negative, zero, or positive.
     */
    static int32
    citextcmp(text *left, text *right, Oid collid)
    {
        char       *lcstr,
                   *rcstr;
        int32       result;
    
        /*
         * We must do our str_tolower calls with DEFAULT_COLLATION_OID, not the
         * input collation as you might expect.  This is so that the behavior of
         * citext's equality and hashing functions is not collation-dependent.  We
         * should change this once the core infrastructure is able to cope with
         * collation-dependent equality and hashing functions.
         */
    
        lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
        rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
    
        result = varstr_cmp(lcstr, strlen(lcstr),
                            rcstr, strlen(rcstr),
                            collid);
    
        pfree(lcstr);
        pfree(rcstr);
    
        return result;
    }
    

    So yes, this should incur some overhead. How expensive it is will also depend on the default collation of the database.

    I'll demonstrate this using a query without an index. I am using the German collation:

    SHOW lc_collate;
     lc_collate 
    ------------
     de_DE.utf8
    (1 row)
    

    First using text:

    CREATE TABLE large_text(t text NOT NULL);
    
    INSERT INTO large_text
       SELECT i||'text'
       FROM generate_series(1, 1000000) AS i;
    
    VACUUM (FREEZE, ANALYZE) large_text;
    
    \timing on
    
    SELECT * FROM large_text WHERE t = TEXT 'mama';
     t 
    ---
    (0 rows)
    
    Time: 79.862 ms
    

    Now the same experiment with citext:

    CREATE TABLE large_citext(t citext NOT NULL);
    
    INSERT INTO large_citext
       SELECT i||'text'
       FROM generate_series(1, 1000000) AS i;
    
    VACUUM (FREEZE, ANALYZE) large_citext;
    
    \timing on
    
    SELECT * FROM large_citext WHERE t = CITEXT 'mama';
     t 
    ---
    (0 rows)
    
    Time: 567.739 ms
    

    So citext is about seven times slower.

    But don't forget that each of these experiments performed a sequential scan with a million comparisons.
    If you use an index, the difference will not be noticeable:

    CREATE INDEX ON large_text (t);
    
    Time: 5443.993 ms (00:05.444)
    
    SELECT * FROM large_text WHERE t = CITEXT 'mama';
     t 
    ---
    (0 rows)
    
    Time: 1.867 ms
    
    
    CREATE INDEX ON large_citext (t);
    
    Time: 28009.904 ms (00:28.010)
    
    SELECT * FROM large_citext WHERE t = CITEXT 'mama';
     t 
    ---
    (0 rows)
    
    Time: 1.988 ms
    

    You see that CREATE INDEX takes much longer for the citext columns (it has to perform a lot of comparisons), but the queries take about the same time.

    The reason is that you need only few comparisons if you use an index scan: for each of the 2-3 index blocks you access you perform a binary search, and you may have to re-check the table row found in the case of a bitmap index scan.