I'm seeing a strange output from pgbench
, and I don't know what other tools there are to try to understand the reasons for disproportionately big latencies.
Here are the results of explain
run against the populated pgbench
database:
pgbench=# explain analyze UPDATE pgbench_tellers SET tbalance = tbalance + 12345 WHERE tid = 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Update on pgbench_tellers (cost=0.00..2.25 rows=1 width=358) (actual time=0.077..0.077 rows=0 loops=1)
-> Seq Scan on pgbench_tellers (cost=0.00..2.25 rows=1 width=358) (actual time=0.023..0.025 rows=1 loops=1)
Filter: (tid = 100)
Rows Removed by Filter: 99
Total runtime: 0.092 ms
(5 rows)
pgbench=# explain analyze UPDATE pgbench_branches SET bbalance = bbalance + 12345 WHERE bid = 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Update on pgbench_branches (cost=0.00..1.13 rows=1 width=370) (actual time=0.045..0.045 rows=0 loops=1)
-> Seq Scan on pgbench_branches (cost=0.00..1.13 rows=1 width=370) (actual time=0.027..0.028 rows=1 loops=1)
Filter: (bid = 10)
Rows Removed by Filter: 9
Total runtime: 0.060 ms
(5 rows)
So, according to the planner, both queries should be in the same ballpark, with the second query being, perhaps, slightly faster due to fewer rows it has to scan.
The reality is, however this:
tps = 695.349735 (including connections establishing)
tps = 695.355481 (excluding connections establishing)
statement latencies in milliseconds:
0.002471 \set nbranches 1 * :scale
0.000743 \set ntellers 10 * :scale
0.000684 \set naccounts 100000 * :scale
0.000786 \setrandom aid 1 :naccounts
0.000585 \setrandom bid 1 :nbranches
0.000609 \setrandom tid 1 :ntellers
0.000583 \setrandom delta -5000 5000
0.124286 BEGIN;
0.225984 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.167551 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.570490 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
3.230071 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.134970 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CU
RRENT_TIMESTAMP);
7.037875 END;
In other words, the query that was believed to complete about 30% faster in reality completed 5000% slower.
My guess, so far, is that PostgreSQL planner has defaults that are appropriate for low-latency storage, however, these tests are performed against an iSCSI device with, potentially much higher latency (but also high queue depth and high bandwidth).
So, I would like to find out if there's a way to produce a more detailed report on what I/O operations PostgreSQL performed during the test (from its perspective, I can see the iSCSI portal side). And, if possible, a way to configure the planner to adjust its estimates for the "typical cloud" situation, where latencies are high, but so are queue depths and bandwidth.
After enabling auto_explain
as well as few other logging configuration changes, below is the relevant output:
LOG: duration: 2667.835 ms plan:
Query Text: UPDATE pgbench_branches SET bbalance = bbalance + 3781 WHERE bid = 9;
Update on pgbench_branches (cost=4.26..8.27 rows=1 width=370) (actual time=2667.834..2667.834 rows=0 loops=1)
Buffers: shared hit=11
-> Bitmap Heap Scan on pgbench_branches (cost=4.26..8.27 rows=1 width=370) (actual time=0.010..0.010 rows=1 loops=1)
Recheck Cond: (bid = 9)
Buffers: shared hit=3
-> Bitmap Index Scan on pgbench_branches_pkey (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: (bid = 9)
Buffers: shared hit=2
LOG: duration: 9.604 ms plan:
Query Text: UPDATE pgbench_tellers SET tbalance = tbalance + 3845 WHERE tid = 33;
Update on pgbench_tellers (cost=0.00..8.28 rows=1 width=358) (actual time=9.604..9.604 rows=0 loops=1)
Buffers: shared hit=7
-> Index Scan using pgbench_tellers_pkey on pgbench_tellers (cost=0.00..8.28 rows=1 width=358) (actual time=0.009..0.010 rows=1 loops=1)
Index Cond: (tid = 33)
Buffers: shared hit=3
So, the difference seems to be in Bitmap Heap Scan
+ Bitmap Index Scan
vs Index Scan
. I'm still unsure as to why it's different and what it means in terms of I/O.
From more reading about kinds of scans and the rationale behind using them, it seems that PostgreSQL chooses to use Index Scan
for bigger table, because it believes that most results will be irrelevant to the query, while it uses Bitmap Heap Scan
because it believes that the result set will be small enough that it's easier to scan "everything" and then filter out unnecessary result (of which there won't be too many). However, due to the huge latency impact, the heap scan turns out to be the wrong strategy.
My previous interpretation was wrong. Or, at least, not the full answer. After adding I/O time reporting and lock time reporting, it appears that the time PostgreSQL waits to acquire locks is responsible for the delays. Now, the bigger question is why the contention on the larger table is so much higher, or is there some other mechanism at play that makes these locks so terribly slow (sometimes > 3 seconds).
Unless you are running something else besides pgbench on the database, nothing would lock you for two seconds.
Also, the bitmap heap scan is not at fault (it is fast), it is the update itself.
The most likely cause is I/O overload – check the I/O wait time spent by the CPU.
Another (unlikely) possibility, perhaps in combination with the previous one, would be a tiny shared_buffers
, so that the backend cannot find a clean buffer.