Efficiently indexing sparse nullable column in Postgres

I need to index a Postgres column that consists of mostly NULL values. I don't want the NULL values to be stored in the index (to make the index smaller, and to speed up row insertion). However, adding a partial index on IS NOT NULL only lets me search efficiently for all non-null values, not for a specific non-null value.

How can I define an index on the value in this column that excludes only NULL values from the index?

Solution

partial index on IS NOT NULL only lets me search efficiently for all non-null values, not for a specific non-null value.

That sounds like one of these:

create index on tbl(col is not null);
create index on tbl(col is not null, col);
create index on tbl(col, col is not null);

But none of the above is a partial index. That would be

create index on tbl(col)where(col is not null);

And it does let you "search efficiently for specific non-null values" while being exactly "an index on the value in this column that excludes only NULL values from the index".

_{demo at db<>fiddle}

select setseed(.42);

create table tbl(col,r)as 
  select case when .5<random() then n::int end as col--50% null `col` values
        ,random() as r
  from generate_series(1,7e5)n 
  where .1<random()
  order by random();

create index idx1_normal_no_deduplication on tbl(col)
  with(deduplicate_items=false);
create index idx2_partial_no_deduplication on tbl(col)
  with(deduplicate_items=false)
  where(col is not null);
create index idx3_normal_deduplicated on tbl(col)
  with(deduplicate_items=true);
create index idx4_partial_deduplicated on tbl(col)
  with(deduplicate_items=true)
  where(col is not null);

You can check how much smaller the partial index is compared to the normal one:

select indexrelname
     , pg_relation_size(indexrelid)
     , pg_size_pretty(pg_relation_size(indexrelid))
from pg_stat_all_indexes i 
  join pg_class c 
  on i.relid=c.oid 
where i.relname='tbl'
order by 2;

indexrelname	pg_relation_size	pg_size_pretty
idx2_partial_no_deduplication	7118848	6952 kB
idx4_partial_deduplicated	7118848	6952 kB
idx3_normal_deduplicated	9289728	9072 kB
idx1_normal_no_deduplication	14180352	14 MB

Which makes sense considering 50% of col are null, and all other values are unique. It also shows that even with deduplication enabled, the partial index can still have much smaller footprint than the deduplicated, non-partial. With deduplication off, you get 50% smaller index and there's still 25% to be saved even with it on. After all, equal values aren't skipped entirely, they are still there but better compacted.
It goes without saying that with less null values the difference will be less pronounced.

Leaving only the partial, deduplicated variant:

explain analyze verboseselect from tbl where col<8e4;


Index Only Scan using idx4_partial_deduplicated on public.tbl (cost=0.42..9804.20 rows=209863 width=0) (actual time=0.060..90.552 rows=36110 loops=1)

explain analyze verbose select from tbl where col=8e4;


Gather (cost=1000.42..7904.08 rows=3148 width=0) (actual time=81.817..82.158 rows=0 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Parallel Index Only Scan using idx4_partial_deduplicated on public.tbl (cost=0.42..6589.28 rows=1015 width=0) (actual time=50.420..50.421 rows=0 loops=4)

Considering col is null disables index use since those cases are precisely what the index ignores (also, because it might be easier to seq scan if the conditions describe big enough part of the set):

explain analyze verbose select from tbl where col=8e4 or col is null;


Seq Scan on public.tbl (cost=0.00..12539.83 rows=314781 width=0) (actual time=0.014..102.891 rows=314082 loops=1)