Search code examples
sqlpostgresqlmd5immutabilitygenerated-columns

How to create a GENERATED column containing the MD5 of multiple columns?


I tried to add the following table in PostgreSQL 14.3:

CREATE TABLE client_cache (
    id            BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    request       VARCHAR COMPRESSION lz4 NOT NULL CHECK (LENGTH (request) <= 10240),
    request_body  BYTEA COMPRESSION lz4 NOT NULL CHECK (LENGTH (request_body) <= 1048576),
    request_hash VARCHAR GENERATED ALWAYS AS (MD5(ROW(request::BYTEA, request_body)::VARCHAR)) STORED
);

But Postgres complains:

[42P17] ERROR: generation expression is not immutable

I've seen many answers discussing how to create a GENERATED column containing the MD5 of a single column, but as soon as you add ROW() to calculate MD5 over multiple columns, the expression is no longer immutable.

I can create a GENERATED column using ROW(MD5(A), MD5(B)) but not MD5(ROW(A, B)).

What can I do instead to create a single MD5 value over multiple columns of varying types (as seen above)?

I know that I can create a view or populate a column using a trigger, but I'd really like to stick to a GENERATED column if possible.


Solution

  • I suggest an immutable helper function:

    CREATE OR REPLACE FUNCTION f_request_md5(_request text, _request_body bytea)
      RETURNS uuid
      LANGUAGE sql IMMUTABLE PARALLEL SAFE AS 
    'SELECT md5(textin(record_out((md5(_request_body), _request))))::uuid';
    

    And a table like this:

    CREATE TABLE client_cache (
      id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
    , request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
    , request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
    , request_hash uuid   GENERATED ALWAYS AS (f_request_md5(request, request_body)) STORED
    );
    

    db<>fiddle here

    Note the more efficient uuid instead of varchar. See:

    Background

    There are two overloaded variants of md5() in Postgres 14 (or any supported version):

    test=> SELECT (proargtypes::regtype[])[0], prorettype::regtype, provolatile
    test-> FROM   pg_proc
    test-> WHERE  proname = 'md5';
     proargtypes | prorettype | provolatile 
    -------------+------------+-------------
     bytea       | text       | i
     text        | text       | i
    (2 rows)
    

    One takes bytea, one text, both are IMMUTABLE and return text. So this expression is immutable:

    ROW(MD5(request), MD5(request_body))
    

    But this is not, like you found out the hard way:

    MD5(ROW(A, B)::varchar)
    

    The text representation of a record is not immutable. There are many reasons. One obvious reason for the case at hand: bytea output can be in (default) hex format or in the obsolescent escape format. A plain

    SET bytea_output = 'escape'; 
    

    ... would break your generated column.

    To get an immutable text representation of a bytea value, you'd run it through encode(request_body, 'hex'). But don't go there. md5(request_body) gives us a faster immutable text "representation" for our purpose.

    We still can't cast a record. So I created the wrapper function. Be sure to read this related answer for more explanation:

    Like discussed in that answer, the new, built-in function hash_record_extended() would be much more efficient for the purpose. So if a bigint is good enough, consider this:

    CREATE TABLE client_cache2 (
      id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
    , request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
    , request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
    , request_hash bigint GENERATED ALWAYS AS (hash_record_extended((request, request_body), 0)) STORED
    );
    

    same db<>fiddle here

    Works out of the box in Postgres 14 or later.

    Related: