Search code examples
postgresqloptimizationcachingtreegist

Speeding up initial ltree, GiST-indexed queries


Speeding up initial ltree, GiST-indexed queries in Postgres

Setup

I'm experimenting with ltree and pattern searching. It's pretty great, but I'm finding that first runs on searches are very slow compared to later runs. I don't know if this is a general issue around loading and caching index pages, or something more specific to GiST indexes.

For background, I'm trying a few different ideas out in a Materialized View with the following column list:

+-------------------------------------------------+-----------+
| column_name                                     | data_type |
+-------------------------------------------------+-----------+
| edge_sequence                                   | uuid      |
| total_seconds                                   | bigint    |
| source_node_count                               | bigint    |
| node_count                                      | integer   |
| nodes_without_adjacent_duplicates_count         | integer   |
| distinct_states_count                           | integer   |
| has_duplicate_adjacent_events                   | boolean   |
| has_circular_links                              | boolean   |
| state_names                                     | citext[]  |
| state_names_ltree                               | ltree     |
| state_names_ltree_text                          | text      |
| state_names_without_duplicate_adjacent_elements | citext[]  |
| distinct_state_names_ltree                      | ltree     |
| state_symbols                                   | text[]    |
| state_symbols_ltree                             | ltree     |
| state_symbols_cstring                           | text      |
+-------------------------------------------------+-----------+

Here's the GiST index definition for the state_names_ltree field:

DROP INDEX IF EXISTS edge_cycle_state_names_ltree_gist;
CREATE INDEX edge_cycle_state_names_ltree_gist
          ON analytics.edge_cycle
       USING GIST (state_names_ltree 
                   gist_ltree_ops(siglen=100) -- siglen defaults to 8, can be set as high as 2024, in 4-byte increments
                  )
        WITH (fillfactor = 100) -- The data will be in a Materialized View, or TRUNCATE-and-rebuild table.
       WHERE node_count <= 24;  -- Index entries are too long for very long chains (hundreds+)

Query and Plans

This query uses the index to find 581 rows out of the 7,745,994 in the data set:

SELECT edge_sequence,state_names_ltree 
  FROM edge_cycle 
 WHERE state_names_ltree ~ 'WaitingAfterSterilizer.*.Decon.*.Sterilizer|VerifyingAfterSterilizer'::lquery
   AND node_count <= 24;  -- IMPORTANT, the index we want the planner to pick includes this condition.

Here's the plan on a cold cache:

Index Scan using edge_cycle_state_names_ltree_gist on edge_cycle  (cost=0.41..186.90 rows=159 width=180) (actual time=184.459..126255.450 rows=581 loops=1)
  Index Cond: (state_names_ltree ~ 'WaitingAfterSterilizer.*.Decon.*.Sterilizer|VerifyingAfterSterilizer'::lquery)
  Buffers: shared hit=21 read=215134
  I/O Timings: shared read=124327.468
Planning:
  Buffers: shared hit=217 read=14
  I/O Timings: shared read=10.586
Planning Time: 11.474 ms
Execution Time: 126255.839 ms

Here's the plan on an early run, but not on a completely cold cache, where I cancelled the search before it finished. So, a luke-warm cache:

Index Scan using edge_cycle_state_names_ltree_gist on edge_cycle  (cost=0.41..186.90 rows=159 width=180) (actual time=1.608..62277.936 rows=581 loops=1)
  Index Cond: (state_names_ltree ~ 'WaitingAfterSterilizer.*.Decon.*.Sterilizer|VerifyingAfterSterilizer'::lquery)
  Buffers: shared hit=102691 read=112443
  I/O Timings: shared read=60442.362
Planning:
  Buffers: shared hit=1
Planning Time: 0.246 ms
Execution Time: 62278.214 ms

And now the plan when running the same query again immediately after letting a slow-and-load query finish:

Index Scan using edge_cycle_state_names_ltree_gist on edge_cycle  (cost=0.41..186.90 rows=159 width=180) (actual time=1.501..1505.078 rows=581 loops=1)
  Index Cond: (state_names_ltree ~ 'WaitingAfterSterilizer.*.Decon.*.Sterilizer|VerifyingAfterSterilizer'::lquery)
  Buffers: shared hit=215134
Planning:
  Buffers: shared hit=1
Planning Time: 0.232 ms
Execution Time: 1505.176 ms

At the extremes, the query speed is around 84x different. This difference seems to be in the read part of the buffers information:

  Buffers: shared hit=21 read=215134
  I/O Timings: shared read=124327.468

  Buffers: shared hit=102691 read=112443
  I/O Timings: shared read=60442.362

  Buffers: shared hit=215134

So, it looks like the speed hit is coming from loading the index pages.

Speeding Things Up

I've had a look around for what settings I might use to speed up the initial search, but haven't found anything promising yet. I'm not at all well versed in how to tune the Postgres cache. Either on the hardware or software configuration side. It's easy to find mentions to GiST indexes being slower than other index types but, in this case, it's plenty fast enough. Once the pages are loaded. I've been using a pg_trgm GiST index for years, and it is fantastic.

Hardware, Version and Settings

We're running 16.4 on an RDS db.m5.2xlarge instance, which has a vCPU count of 8, and RAM of 32 GB. Here are the cache-related settings I found:

select name,setting,unit from pg_settings where name like '%cache%' order by 1;

/*
+----------------------+---------+------+
| name                 | setting | unit |
+----------------------+---------+------+
| debug_discard_caches | 0       | NULL |
| effective_cache_size | 2015397 | 8kB  |
| plan_cache_mode      | auto    | NULL |
+----------------------+---------+------+
*/

effective_cache_size is the only setting here that looks relevant. And, unless I'm messing up my arithmetic, that's a pretty large cache out of 32GB?

I've tried changing FILLFACTOR to 100%

While writing this up, I considered the index FILLFACTOR. The default for B-trees is 90%, I don't know what the default is for GiST indexes. This data will be in a Materialized View, or wipe-and-replace table. meaning it doesn't have to contend with updates. A GiST, without an explicit override, gets you NULL in reloptions:

SELECT pg_class.relname      AS table_name, 
       pg_class.reloptions
       
  FROM pg_class 
  
  JOIN pg_namespace  
    ON pg_namespace.oid = pg_class.relnamespace

WHERE pg_class.relname = 'edge_cycle_state_names_ltree_gist';

I ran an experiment, and compared the size of the index with the default fill, and then a 100% fill. It looks like GiST is also probably at a 90% FILLFACTOR, at least when using gist_ltree_ops:

select pg_size_pretty(pg_relation_size('edge_cycle_state_names_ltree_gist')); -- Default fill: 2,908 MB
select pg_size_pretty(pg_relation_size('edge_cycle_state_names_ltree_gist')); -- 100% fill:    2,601 MB

select 2601/2908::real; -- 0.894

Other Ideas I've Considered

Don't Search Against a GiST Index

ltree requires a GiST index, if you want indexed pattern searching. But here's what I get when I bypass the index by not including the node_count <= 24 condition in the query:

Gather  (cost=1000.00..1092064.13 rows=159 width=180) (actual time=7.350..2187.419 rows=613 loops=1)
  Workers Planned: 4
  Workers Launched: 4
  ->  Parallel Seq Scan on edge_cycle  (cost=0.00..1091048.23 rows=40 width=180) (actual time=15.969..2180.520 rows=123 loops=5)
        Filter: (state_names_ltree ~ 'WaitingAfterSterilizer.*.Decon.*.Sterilizer|VerifyingAfterSterilizer'::lquery)
        Rows Removed by Filter: 1549076
Planning Time: 0.348 ms
Execution Time: 2187.555 ms

That's pretty decent. Slower than a fast GiST query, faster than a slow initial GiST query, more consistent to naive user. (Where "naive" = "any normal person who is not going to get into the weeds.")

Experimenting with different siglen values.

The range is 2-2024, in 4 byte increments, with a default of 8. I set it to 100. I came up with that number by throwing a dart at a board. Longer siglen means larger index nodes, means more index pages, means slower initial load. Or so it seems. I can explore with a shorter siglen. Any real-world suggestions on how to tune the siglen would save me lot of trouble.

Reducing the Node Size with Symbols

In theory, an ltree path can store up to 65,535 labels. In reality, it blows up when the full label path is somehwere under 12,000 bytes. Our raw paths are relatively short (I'm only including paths with lengths of 24 nodes or less, usually 7 or less), but the labels are relatively long. There are only around 60 distinct labels, so I can "symbolize" them as 2-char hex values. For example:

WaitingAfterSterilizer.Decon.Sterilizer

AO.01.04

The symbolized version is completely unreadable, but makes the nodes smaller. Meaning, more nodes in a page, less time loading pages. I've compared savings on index size recently, and the index size correlates with the path links, but not proportionally. I guess there's some kind of overhead/price of admission, no matter what the data. After that, the data itself does make a difference. So, in my cases, I can save 30-60% of the index by converting the labels to shortened versions.

( haven't explored the symbolization yet much for this loading problem, but smaller is smaller. I set up the symbolization system to simplify experimenting with NetworkX in Python. Also, to run some text similarity experiments in Postgres. (Levenshtein does surprisingly well on the symbolized paths, considering. Trigrams work poorly, which is what you would expect.) But, for ltree path searches, I'd need to write an interface to translate paths into symbolized paths and....ugh. Even for me, that seems over-the-top. I guess that I could write a set-returning function that takes an ltree, parses it out, substitutes labels with their symbols, and runs the query. Pure SQL for this function should be possible. It could return all column, as the planner seems quite smart about pruning out columns that aren't included in the outer query.

I've wandered around a bit here, but I'm unclear which of the various details and ideas included may be relevant. I'll be grateful for any suggestions or warnings.


Solution

  • I posted this question a few weeks back, and then was on the road. During that time, I dug into a stack of the foundational papers behind GiST, posts on-line, and more. I've listed the best bits at the end of this answer.

    Note that I'm putting things together here as a normal person. So, plenty of conjecture. Folks coding Postgres in C, and working with the index APIs could definitely improve and correct my summary.

    Summary

    It turns out that the index configuration, data distribution, search pattern and cache conditions interact in predictable, if not obvious ways. For example, a longer signature length (siglen) makes some queries slower, and a very broad query may be much slower using a GiST index on a cold cache than a parallelized sequential scan under identical conditions. On the other hand, for more selective searches, the GiST index is magically fast.

    Take Aways

    Tuning comes down to matching your index to your query patterns. This post offers some information to help make informed judgements. Briefly:

    • If your queries are fast, then there's nothing else to think about.

    • Broad (low-selectivity) searches need to load many, most, or all index pages.

    • Narrow (high-selectivity) searches only need a few index pages, and are phenomenal.

    • Anything starting with *. is a broad query pattern.

    • Parallel sequential plans can perform quite well in PG 16.4. ltree has been around for decades, so you may run across commentary before the sequential plans were as good as today.

    • You can, if you need, define your index in such a way that you can control if the planner uses it or not.

    • ltree GiST index keys are probabilistic, not deterministic. Meaning that a secondary consistency check pass is required before returning rows.

    • Roughly speaking a longer siglen improves narrow search speeds (fewer entries to check), and slows down initial broad queries (more index pages to load.)

    Okay, now for the details. I'm taking the time to write this up because I found it interesting, didn't see it all pulled together in one place and, knowing me, I'll forget all about it and then find this answer three years from now when I need it ;-)

    Index Access Methods

    This is where the background reading helped out. GiST isn't an index type, it's an index framework. The actual index type is implemented separately, and manages key index-related events and data structures. WAL logging, contention handling, etc. is managed by core Postgres. If you're not already familiar with the index API architecture in Postgres in broad strokes, it's quite interesting. I didn't read up on this again, and don't have any fresh links to include. The docs start here, and there are plain-language reviews around too.

    https://www.postgresql.org/docs/current/indexam.html

    GiST Trees Are Balanced

    Important detail, GiST (Generalized Search Trees) are balanced trees, like B-trees, R-trees, and so on. So, if you know roughly how B-trees work, you've got a partially correct picture in your head already. If you want a more accurate picture, take a look at the GiST section of PostgreSQL 14 Internals:

    https://postgrespro.com/community/books/internals

    Here's an example taken from page 532 of the book (chapter 26):

    Sample tree index from page 532 of Postgres 14 Internals

    GiST for ltree

    An ltree GiST index is implemented using RD (Russian Doll) trees, with index entries stored as a signature. An RD-tree is an extension of the idea behind an R-tree (Rectangle Tree), familiar to folks who use point, PostGIS, or other geometric data types. Basically, you split up a geographic space into a series of rectangles, and then place points in the most likely box. RD trees take that idea and extend it past data that's naturally sortable in one dimension (alphas, numbers, times and dates, etc.) or mappable to a 2D coordinate space (R-trees.) In other words, data types for things like graphs, videos, images, whatever.

    An RD tree is an index based set predicates. Think of it as an index structure that answers the question "might this subtree contain matches on the query?" the illustration above makes this easier to follow. The answers are maybe or no, they are not yes or no. It's a probabilistic structure, not a deterministic one. Meaning, the entire basis for query optimization is having a query pattern that can definitely exclude pages. After that, the system then needs to recheck the items in each maybe page for consistency with the search pattern.

    The previous paragraph explains why a longer siglen makes some queries slower. Imagine a worst case query that has to scan every page. Longer keys mean more pages, more pages means more I/O to load, and more entries to scan for true matches.

    Note: It appears from the source that the consistency check on the index page entry is determinative, and does not require loading the original row from disk. This is Very Good News. It's also not an assumption that you can make about an RD tree index.

    Low-selectivity pattern: GiST

    With all of the above in in mind, look at this query:

    EXPLAIN (ANALYZE true, COSTS true, BUFFERS true)
    select *
       from edge_cycle 
      where state_names_ltree ~ '*'
        and node_count <= 24;    
            
    /*
    
    Seq Scan on edge_cycle  (cost=0.00..1183076.31 rows=7733558 width=1009) (actual time=1.369..16727.613 rows=7729412 loops=1)
      Filter: ((state_names_ltree ~ '*'::lquery) AND (node_count <= 24))
      Rows Removed by Filter: 16582
      Buffers: shared hit=760 read=1067003
      I/O Timings: shared read=13321.146
    Planning:
      Buffers: shared hit=206 read=28
      I/O Timings: shared read=20.079
    Planning Time: 22.023 ms
    Execution Time: 17079.720 ms
    
    */
    

    As you can see, we're not on a cold cache, but there was still a lot of I/O involved in loading the index:

      Buffers: shared hit=760 read=1067003
    

    What's happening? The query pattern does not allow any page pruning. Because *. potentially matches everything. The whole index needs to be traversed to check each entry against the query pattern.

    Low-selectivity pattern: Sequential

    EXPLAIN (ANALYZE true, COSTS true, BUFFERS true)
    select *
       from edge_cycle 
      where state_names_ltree ~ '*.WaitingAfterSterilizer.*';
         -- No node_count condition, index is ignored.d node_count <= 24;    
    
    /*
    Seq Scan on edge_cycle  (cost=0.00..1163703.93 rows=7660201 width=1009) (actual time=0.016..5732.974 rows=7666023 loops=1)
      Filter: (state_names_ltree ~ '*.WaitingAfterSterilizer.*'::lquery)
      Rows Removed by Filter: 79971
      Buffers: shared hit=953 read=1066810
      I/O Timings: shared read=2341.925
    Planning:
      Buffers: shared hit=234
    Planning Time: 0.625 ms
    Execution Time: 6072.919 ms
    *
    

    Note that the numbers above are from an RDS deployment where, honestly, performance is a bit confounding on any given day. However, I've seen similar patterns testing locally. Likewise, the exact plan varies depending on the condition of statistics, etc. I regularly see parallel plans generated for sequential scans, and they perform quite well under PG 16.4. The table is currently a materialized view (just testing), with a bit under 8M rows.

    Skipping the index

    In my case, skipping the index is easy because I've made it a partial index based on node_count:

    CREATE INDEX edge_cycle_state_names_ltree_gist
              ON analytics.edge_cycle
           USING GIST (state_names_ltree 
                       gist_ltree_ops(siglen=100) -- siglen defaults to 8, can be set as high as 2024, in 4-byte increments
                      )
            WITH (fillfactor = 100) -- The data will be in a Materialized View, or TRUNCATE-and-rebuild table.
           WHERE node_count <= 24;  -- This covers around 99.975% of our rows, the others are....suspect.
    

    While not explicitly hinting the planner (which would be Unclean), including and excluding this clause on the WHERE implicitly guides the planner to consider or ignore the index:

        and node_count <= 24;    
    

    For More Information

    If anyone feels like going full-nerd, the most useful additional resources I found are listed below:

    The RD-Tree: An Index Structure for Sets https://dsf.berkeley.edu/papers/UW-CS-TR-1252.pdf

    The README for the GiST sections of Postgres: https://github.com/postgres/postgres/blob/master/src/backend/access/gist/README

    The GiST section of the PostgreSQL 14 Internals book (free PDF available)

    https://postgrespro.com/community/books/internals

    Part V, on indexing, is a contemporary rewrite from a series on Postgres indexes that's shown up in various places down the decades.

    Looking Inside Postgres at a GiST Index, part of a great series on ltree https://patshaughnessy.net/2017/12/15/looking-inside-postgres-at-a-gist-index