Search code examples
sqlpostgresqlpgvector

Converting bytea to int[] in SQL in Postgres


My table in a PostgreSQL database stores a large number of integer arrays in the column called vector. Since all the numbers fit into the range from -128 to 127 and to save space, I use the bytea type. I need to use the dot product distance operator <#> provided by the pgvector extension, e.g.:

select id
from MyTable
order by vector<#>key
limit 10;

However, this operator does not support the bytea type. I want to convert my vectors to int[], which can then be cast to the vector type supported by pgvector.

I think the general direction for converting bytea to int[] might be along the lines of:

SELECT string_to_array(encode(vector, 'escape'), '\\')::int[]
FROM MyTable;

How do I modify this to make the conversion work?


Solution

  • You can join with generate_series() and use the get_byte() function:

    WITH b(b) AS (VALUES (BYTEA '\x55FF1200DEADBEEF'))
    SELECT array_agg(get_byte(b.b, g.i) - 128)
    FROM b
       CROSS JOIN LATERAL generate_series(0, length(b) - 1) AS g(i);
    
                array_agg             
    ══════════════════════════════════
     {-43,127,-110,-128,94,45,62,111}
    (1 row)
    

    This is going to be slow and inefficient, but that is to be expected if you try to save space at all costs.

    I would recommend to use a smallint[], which uses 2 bytes per number.

    If you are running PostgreSQL v15 or later, you could also use an array of the internal data type "char", which takes only one byte per element:

    SELECT ARRAY[-43,127,-110,-128,94,45,62,111]::"char"[];
    
                     array                  
    ════════════════════════════════════════
     {"\\325",\x7F,"\\222","\\200",^,-,>,o}
    (1 row)
    

    To get an individual element, you use

    SELECT (ARRAY[-43,127,-110,-128,94,45,62,111]::"char"[])[3]::integer;
    
     array 
    ═══════
      -110
    (1 row)
    

    A "char"[] uses one byte per element plus an overhead of 21 bytes per array (a bytea has an overhead of one or four bytes per value, depending on the size).