Search code examples
databaseperformancepostgresqlindexingquery-optimization

Which of the two PostgreSQL indexes is more efficient?


I have the following PostgreSQL schema:

CREATE TABLE User (
    ID INTEGER PRIMARY KEY
);

CREATE TABLE BOX (
    ID INTEGER PRIMARY KEY 
);

CREATE SEQUENCE seq_item;

CREATE TABLE Item (
    ID INTEGER PRIMARY KEY DEFAULT nextval('seq_item'),
    SENDER INTEGER REFERENCES User(id),
    RECEIVER INTEGER REFERENCES User(id),
    INFO TEXT,
    BOX_ID INTEGER REFERENCES Box(id) NOT NULL,
    ARRIVAL TIMESTAMP
);

Its main use case is a typical producer/consumer scenario. Different users may insert an item in the database in a particular box for a particular user and each user can retrieve the topmost(this means the oldest) item in a box that is addressed to her/him. It more or less mimics the functionality of a queue on a database level.

More precisely, the most common operations are the following:

INSERT INTO ITEM(SENDER, RECEIVER, INFO, BOX_ID, ARRIVAL) 
VALUES (nsid, nrid, ncontent, nqid, ntime);

And retrieve commands based on a combination of either RECEIVER+SENDER or RECEIVER+BOX_ID:

SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND 
(i.BOX_ID=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;

and

SELECT * INTO it FROM Item i WHERE (i.RECEIVER=? OR i.RECEIVER is NULL) AND 
(i.SENDER=?) ORDER BY ARRIVAL LIMIT 1;
DELETE FROM Item i WHERE i.id=it.id;

The last two snippets are packed within a stored procedure.

I considered using two different indexes.

1. CREATE INDEX ind ON item(arrival);. The EXPLAIN plan for the aforementioned SELECT is the following:

Limit  (cost=0.29..2.07 rows=1 width=35)
  ->  Index Scan using ind on item i  (cost=0.29..3010.81 rows=1693 width=35)
        Filter: (((receiver = 2) OR (receiver IS NULL)) AND (sender = 2))

As far as I understand, the advantage of this approach is that I avoid sorting the data. However, as far as I understand, I still need to scan the whole table, but the access will be random and this will slow down the execution. What I am not sure is if the execution will stop as soon as a match is found because of LIMIT 1 or it will always scan the whole table.

2. CREATE INDEX ind ON item(receiver, sender); The EXPLAIN:

Limit  (cost=512.23..512.23 rows=1 width=35)
  ->  Sort  (cost=512.23..516.46 rows=1693 width=35)
        Sort Key: arrival
        ->  Bitmap Heap Scan on message m  (cost=42.37..503.76 rows=1693 width=35)
              Recheck Cond: (((receiver = 2) AND (sender = 2)) OR ((receiver IS NULL) AND (sender = 2)))
              ->  BitmapOr  (cost=42.37..42.37 rows=1693 width=0)
                    ->  Bitmap Index Scan on ind  (cost=0.00..37.22 rows=1693 width=0)
                          Index Cond: ((receiver = 2) AND (sender = 2))
                    ->  Bitmap Index Scan on ind  (cost=0.00..4.30 rows=1 width=0)
                          Index Cond: ((receiver IS NULL) AND (sender = 2))

In this scenario I can efficiently find the matches for a receiver and sender, but I need to sort the result afterwards, which may be slow.

So which of the two options is better and why? The estimated cost for the first is way lower, but the second one seems to be more "deterministic".


Solution

  • For this query:

    SELECT * INTO it
    FROM Item i
    WHERE (i.RECEIVER = ? OR i.RECEIVER is NULL) AND 
          (i.SENDER = ?)
    ORDER BY ARRIVAL
    LIMIT 1;
    

    The best index is likely to be item(sender, arrival, receiver), in that order. This will filter by sender, then use the index for ordering, and filter again by receiver.

    The fastest way to go may be:

    select *
    from ((select i.*
           from item i
           where receiver = ? and sender = ?
           order by arrival
           limit 1
          ) union all
          (select i.*
           from item i
           where receiver is null and sender = ?
           order by arrival
           limit 1
          ) 
         ) i
    order by arrival
    limit 1;
    

    The best index for this version is item(sender, receiver, arrival). It will use the index to fetch (at most) one row in each subquery. The final sort (on two rows) is negligible.

    Of course, the same reasoning applies to the other query as well.