postgresql amazon-ec2 indexing query-optimization amazon-ebs

Understanding why my Postgres queries are faster without indexes

We've inherited a Postgresql driven datawarehouse with some serious performance issues. We've selected a database for one of our customers and are benchmarking queries. We've picked into the queries and found a couple of common tables which are selected from in all queries, which we believe is at the heart of our poor performance. We're concentrating particularly on cold start performance, when no data is loaded in to the shared buffers as this is common scenario for our customers. We took a large query and stripped it down to its slowest part;

explain (analyze, buffers, costs, format json)
        select
            poi."SKUId" as "SKUId",
            poi."ConvertedLineTotal" as "totalrevenue",
            poi."TotalDispachCost" as "totaldispachcost",
            poi."Quantity" as "quantitysold"
        from public.processedorder o
        join public.processedorder_item poi on poi."OrderId" = o."OrderId"
        WHERE o."ReceivedDate" >= '2020-09-01' and o."ReceivedDate" <= '2021-01-01';

And in fact, stripped this down further to a single table which seems particularly slow;

explain (analyze, buffers, costs, format json)
select o."OrderId", o."ChannelId"
from public.processedorder o
WHERE o."ReceivedDate" >= '2020-09-01' and o."ReceivedDate" <= '2021-01-01'

This processedorders table has around 2.1 million rows with around 35/40k rows being added each month.

The table looks like this- it's fairly wide;

CREATE TABLE public.processedorder (
    "OrderId" int4 NOT NULL,
    "ChannelId" int4 NOT NULL,
    "ShippingId" int4 NOT NULL,
    "CountryId" int4 NOT NULL,
    "LocationId" int4 NOT NULL,
    "PackagingId" int4 NOT NULL,
    "ConvertedTotal" numeric(18, 6) NOT NULL,
    "ConvertedSubtotal" numeric(18, 6) NOT NULL,
    "ConvertedShippingCost" numeric(18, 6) NOT NULL,
    "ConvertedShippingTax" numeric(18, 6) NOT NULL,
    "ConvertedTax" numeric(18, 6) NOT NULL,
    "ConvertedDiscount" numeric(18, 6) NOT NULL,
    "ConversionRate" numeric(18, 6) NOT NULL,
    "Currency" varchar(3) NOT NULL,
    "OriginalTotal" numeric(18, 6) NOT NULL,
    "OriginalSubtotal" numeric(18, 6) NOT NULL,
    "OriginalShippingCost" numeric(18, 6) NOT NULL,
    "OrignalShippingTax" numeric(18, 6) NOT NULL,
    "OriginalTax" numeric(18, 6) NOT NULL,
    "OriginalDiscount" numeric(18, 6) NOT NULL,
    "ReceivedDate" timestamp NOT NULL,
    "DispatchByDate" timestamp NOT NULL,
    "ProcessedDate" timestamp NOT NULL,
    "HoldOrCancel" bool NOT NULL,
    "CustomerHash" varchar(100) NOT NULL,
    "EmailHash" varchar(100) NOT NULL,
    "GetPostalCode" varchar(10) NOT NULL,
    "TagId" uuid NOT NULL,
    "timestamp" timestamp NOT NULL,
    "IsRMA" bool NOT NULL DEFAULT false,
    "ConversionType" int4 NOT NULL DEFAULT 0,
    "ItemWeight" numeric(18, 6) NULL,
    "TotalWeight" numeric(18, 6) NULL,
    "PackageWeight" numeric(18, 6) NULL,
    "PackageCount" int4 NULL,
    CONSTRAINT processedorder_tagid_unique UNIQUE ("TagId")
)
WITH (
    fillfactor=50
);

The confusion we have is, on a local copy of the database we run the smallest query with a simple index on receivedDate and it returns the results in 4 seconds-

create INDEX if not exists ix_processedorder_btree_receieveddate ON public.processedorder USING btree ("ReceivedDate" DESC);

execution plan can be seen here https://explain.tensor.ru/archive/explain/639d403ef7bf772f698502ed98ae3f63:0:2021-12-08#explain

Hash Join  (cost=166953.18..286705.36 rows=201855 width=19) (actual time=3078.441..4176.251 rows=198552 loops=1)
  Hash Cond: (poi."OrderId" = o."OrderId")
  Buffers: shared hit=3 read=160623
  ->  Seq Scan on processedorder_item poi  (cost=0.00..108605.28 rows=2434228 width=23) (actual time=0.158..667.435 rows=2434228 loops=1)
        Buffers: shared read=84263
  ->  Hash  (cost=164773.85..164773.85 rows=174346 width=4) (actual time=3077.990..3077.991 rows=173668 loops=1)
        Buckets: 262144 (originally 262144)  Batches: 1 (originally 1)  Memory Usage: 8154kB
        Buffers: shared read=76360
        ->  Bitmap Heap Scan on processedorder o  (cost=3703.48..164773.85 rows=174346 width=4) (actual time=27.285..3028.721 rows=173668 loops=1)
              Recheck Cond: (("ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND ("ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
              Heap Blocks: exact=75882
              Buffers: shared read=76360
              ->  Bitmap Index Scan on ix_receiveddate  (cost=0.00..3659.89 rows=174346 width=0) (actual time=17.815..17.815 rows=173668 loops=1)
                    Index Cond: (("ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND ("ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
                    Buffers: shared read=478

We then apply this index to our staging server (same copy of the DB) and run the query, but this time it takes 44 seconds; https://explain.tensor.ru/archive/explain/9610c603972ba89aac4e223072f27575:0:2021-12-08

Gather  (cost=112168.21..275565.45 rows=174360 width=19) (actual time=42401.776..44549.996 rows=145082 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=149058 read=50771
  ->  Hash Join  (cost=111168.21..257129.45 rows=72650 width=19) (actual time=42397.903..44518.824 rows=48361 loops=3)
        Hash Cond: (poi."OrderId" = o."OrderId")
        Inner Unique: true
        Buffers: shared hit=445001 read=161024
        ->  Parallel Seq Scan on processedorder_item poi  (cost=0.00..117223.50 rows=880850 width=23) (actual time=0.302..1426.753 rows=702469 loops=3)
              Filter: ((NOT "ContainsComposites") AND ("SKUId" <> 0))
              Rows Removed by Filter: 108940
              Buffers: shared read=84260
        ->  Hash  (cost=105532.45..105532.45 rows=173408 width=4) (actual time=42396.156..42396.156 rows=173668 loops=3)
              Buckets: 262144 (originally 262144)  Batches: 1 (originally 1)  Memory Usage: 8154kB
              Buffers: shared hit=444920 read=76764
              ->  Index Scan using ix_processedorder_receieveddate on processedorder o  (cost=0.43..105532.45 rows=173408 width=4) (actual time=0.827..42152.428 rows=173668 loops=3)
                    Index Cond: (("ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND ("ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
                    Buffers: shared hit=444920 read=76764

Finally, with common sense apparently not working, we just remove the index on the staging server and find it returns the data in 4 seconds (just like our local machine with the index) https://explain.tensor.ru/archive/explain/486512e19b45d5cbe4b893fdecc434b8:0:2021-12-08

Gather  (cost=1000.00..200395.65 rows=177078 width=4) (actual time=2.556..4695.986 rows=173668 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared read=41868
  ->  Parallel Seq Scan on processedorder o  (cost=0.00..181687.85 rows=73782 width=4) (actual time=0.796..4663.511 rows=57889 loops=3)
        Filter: (("ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND ("ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
        Rows Removed by Filter: 642941
        Buffers: shared read=151028

Before each query run I execute from SSH: echo 3 > /proc/sys/vm/drop_caches; service postgresql restart;

We also ran a vacuum full; analyze; before testing.

Is anyone able to explain why this is happening as it makes no sense to us- I would expect the query to perform fastest with the index, given that we are querying a small portion of the data (order records span 9 years and we are selecting just 3 months).

The server itself is Postgres 10.4 running on Amazon AWS E2 i3.2xlarge instance with a couple io2 EBS block store drives running in RAID 0 hosting the psql data.

work_mem is 150MB
shared_buffers is set to 15Gb (60gb server total ram)
effective_io_concurrency = 256 
effective_cache_size=45GB

----- Update 1 As per franks suggestion we tried adding a new index which didn't seem to help

Gather  (cost=1000.86..218445.54 rows=201063 width=19) (actual time=4.412..61616.676 rows=198552 loops=1)
  Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=239331 read=45293
  ->  Nested Loop  (cost=0.86..197339.24 rows=83776 width=19) (actual time=3.214..61548.378 rows=66184 loops=3)
        Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
        Buffers: shared hit=748815 read=136548
        Worker 0: actual time=2.494..61658.415 rows=65876 loops=1
          Buffers: shared hit=247252 read=45606
        Worker 1: actual time=3.033..61667.490 rows=69100 loops=1
          Buffers: shared hit=262232 read=45649
        ->  Parallel Index Only Scan using ix_processedorder_btree_receieveddate_orderid on public.processedorder o  (cost=0.43..112293.34 rows=72359 width=4) (actual time=1.811..40429.474 rows=57889 loops=3)
              Output: o."ReceivedDate", o."OrderId"
              Index Cond: ((o."ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND (o."ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
              Buffers: shared hit=97131 read=76571
              Worker 0: actual time=1.195..40625.809 rows=57420 loops=1
                Buffers: shared hit=31847 read=25583
              Worker 1: actual time=1.850..40463.813 rows=60469 loops=1
                Buffers: shared hit=34898 read=25584
        ->  Index Scan using ix_processedorder_item_orderid on public.processedorder_item poi  (cost=0.43..1.12 rows=2 width=23) (actual time=0.316..0.361 rows=1 loops=173668)
              Output: poi."OrderItemId", poi."OrderId", poi."SKUId", poi."Quantity", poi."ConvertedLineTotal", poi."ConvertedLineSubtotal", poi."ConvertedLineTax", poi."ConvertedLineDiscount", poi."ConversionRate", poi."OriginalLineTotal", poi."OriginalLineSubtotal", poi."OriginalLineTax", poi."OriginalLineDiscount", poi."Currency", poi."ContainsComposites", poi."TotalDispachCost", poi."TagId", poi."timestamp"
              Index Cond: (poi."OrderId" = o."OrderId")
              Buffers: shared hit=651684 read=59977
              Worker 0: actual time=0.316..0.362 rows=1 loops=57420
                Buffers: shared hit=215405 read=20023
              Worker 1: actual time=0.303..0.347 rows=1 loops=60469
                Buffers: shared hit=227334 read=20065

---- update 2

We've rerun the larger of the two queries

explain (analyze, buffers, verbose, costs, format json)
        select
            poi."SKUId" as "SKUId",
            poi."ConvertedLineTotal" as "totalrevenue",
            poi."TotalDispachCost" as "totaldispachcost",
            poi."Quantity" as "quantitysold"
        from public.processedorder o
        join public.processedorder_item poi on poi."OrderId" = o."OrderId"
        WHERE o."ReceivedDate" >= '2020-09-01' and o."ReceivedDate" <= '2021-01-01';

this time with track io timings enabled- we ran this twice, once with an index, and again without the index- these are the plans;

With Index: https://explain.tensor.ru/archive/explain/d763d7e1754c4ddac8bb61e403b135d2:0:2021-12-09

Gather  (cost=111153.17..252254.44 rows=201029 width=19) (actual time=36978.100..39083.705 rows=198552 loops=1)
  Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=138354 read=60159
  I/O Timings: read=16453.227
  ->  Hash Join  (cost=110153.17..231151.54 rows=83762 width=19) (actual time=36974.257..39044.116 rows=66184 loops=3)
        Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
        Hash Cond: (poi."OrderId" = o."OrderId")
        Buffers: shared hit=444230 read=160640
        I/O Timings: read=37254.816
        Worker 0: actual time=36972.516..39140.086 rows=79787 loops=1
          Buffers: shared hit=155175 read=48507
          I/O Timings: read=9382.648
        Worker 1: actual time=36972.438..39138.754 rows=77496 loops=1
          Buffers: shared hit=150701 read=51974
          I/O Timings: read=11418.941
        ->  Parallel Seq Scan on public.processedorder_item poi  (cost=0.00..114682.68 rows=1014089 width=23) (actual time=0.262..1212.102 rows=811409 loops=3)
              Output: poi."OrderItemId", poi."OrderId", poi."SKUId", poi."Quantity", poi."ConvertedLineTotal", poi."ConvertedLineSubtotal", poi."ConvertedLineTax", poi."ConvertedLineDiscount", poi."ConversionRate", poi."OriginalLineTotal", poi."OriginalLineSubtotal", poi."OriginalLineTax", poi."OriginalLineDiscount", poi."Currency", poi."ContainsComposites", poi."TotalDispachCost", poi."TagId", poi."timestamp"
              Buffers: shared read=84260
              I/O Timings: read=1574.316
              Worker 0: actual time=0.073..1258.453 rows=870378 loops=1
                Buffers: shared read=30133
                I/O Timings: read=535.914
              Worker 1: actual time=0.021..1256.769 rows=841299 loops=1
                Buffers: shared read=29126
                I/O Timings: read=538.814
        ->  Hash  (cost=104509.15..104509.15 rows=173662 width=4) (actual time=36972.511..36972.511 rows=173668 loops=3)
              Output: o."OrderId"
              Buckets: 262144 (originally 262144)  Batches: 1 (originally 1)  Memory Usage: 8154kB
              Buffers: shared hit=444149 read=76380
              I/O Timings: read=35680.500
              Worker 0: actual time=36970.996..36970.996 rows=173668 loops=1
                Buffers: shared hit=155136 read=18374
                I/O Timings: read=8846.735
              Worker 1: actual time=36970.783..36970.783 rows=173668 loops=1
                Buffers: shared hit=150662 read=22848
                I/O Timings: read=10880.127
              ->  Index Scan using ix_processedorder_btree_receieveddate on public.processedorder o  (cost=0.43..104509.15 rows=173662 width=4) (actual time=0.617..36736.881 rows=173668 loops=3)
                    Output: o."OrderId"
                    Index Cond: ((o."ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND (o."ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
                    Buffers: shared hit=444149 read=76380
                    I/O Timings: read=35680.500
                    Worker 0: actual time=0.018..36741.158 rows=173668 loops=1
                      Buffers: shared hit=155136 read=18374
                      I/O Timings: read=8846.735
                    Worker 1: actual time=0.035..36733.684 rows=173668 loops=1
                      Buffers: shared hit=150662 read=22848
                      I/O Timings: read=10880.127

And then, the faster run, without indexes; https://explain.tensor.ru/archive/explain/d0815f4b0c9baf3bdd512bc94051e768:0:2021-12-09

Gather  (cost=231259.20..372360.47 rows=201029 width=19) (actual time=4829.302..7920.614 rows=198552 loops=1)
  Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=106720 read=69983
  I/O Timings: read=2431.673
  ->  Hash Join  (cost=230259.20..351257.57 rows=83762 width=19) (actual time=4825.285..7877.981 rows=66184 loops=3)
        Output: poi."SKUId", poi."ConvertedLineTotal", poi."TotalDispachCost", poi."Quantity"
        Hash Cond: (poi."OrderId" = o."OrderId")
        Buffers: shared hit=302179 read=235288
        I/O Timings: read=7545.729
        Worker 0: actual time=4823.181..7978.501 rows=81394 loops=1
          Buffers: shared hit=87613 read=93424
          I/O Timings: read=2556.079
        Worker 1: actual time=4823.645..7979.241 rows=76022 loops=1
          Buffers: shared hit=107846 read=71881
          I/O Timings: read=2557.977
        ->  Parallel Seq Scan on public.processedorder_item poi  (cost=0.00..114682.68 rows=1014089 width=23) (actual time=0.267..2122.529 rows=811409 loops=3)
              Output: poi."OrderItemId", poi."OrderId", poi."SKUId", poi."Quantity", poi."ConvertedLineTotal", poi."ConvertedLineSubtotal", poi."ConvertedLineTax", poi."ConvertedLineDiscount", poi."ConversionRate", poi."OriginalLineTotal", poi."OriginalLineSubtotal", poi."OriginalLineTax", poi."OriginalLineDiscount", poi."Currency", poi."ContainsComposites", poi."TotalDispachCost", poi."TagId", poi."timestamp"
              Buffers: shared read=84260
              I/O Timings: read=4135.860
              Worker 0: actual time=0.034..2171.677 rows=865244 loops=1
                Buffers: shared read=29949
                I/O Timings: read=1394.407
              Worker 1: actual time=0.068..2174.408 rows=827170 loops=1
                Buffers: shared read=28639
                I/O Timings: read=1395.723
        ->  Hash  (cost=224615.18..224615.18 rows=173662 width=4) (actual time=4823.318..4823.318 rows=173668 loops=3)
              Output: o."OrderId"
              Buckets: 262144 (originally 262144)  Batches: 1 (originally 1)  Memory Usage: 8154kB
              Buffers: shared hit=302056 read=151028
              I/O Timings: read=3409.869
              Worker 0: actual time=4820.884..4820.884 rows=173668 loops=1
                Buffers: shared hit=87553 read=63475
                I/O Timings: read=1161.672
              Worker 1: actual time=4822.104..4822.104 rows=173668 loops=1
                Buffers: shared hit=107786 read=43242
                I/O Timings: read=1162.254
              ->  Seq Scan on public.processedorder o  (cost=0.00..224615.18 rows=173662 width=4) (actual time=0.744..4644.291 rows=173668 loops=3)
                    Output: o."OrderId"
                    Filter: ((o."ReceivedDate" >= '2020-09-01 00:00:00'::timestamp without time zone) AND (o."ReceivedDate" <= '2021-01-01 00:00:00'::timestamp without time zone))
                    Rows Removed by Filter: 1928823
                    Buffers: shared hit=302056 read=151028
                    I/O Timings: read=3409.869
                    Worker 0: actual time=0.040..4651.203 rows=173668 loops=1
                      Buffers: shared hit=87553 read=63475
                      I/O Timings: read=1161.672
                    Worker 1: actual time=0.035..4637.415 rows=173668 loops=1
                      Buffers: shared hit=107786 read=43242
                      I/O Timings: read=1162.254

Solution

From your last two execution plans, I gather the following:

the slow plan using the index reads 22848 blocks from disk for processedorder, which takes 10.88 seconds
the fast plan using the sequential scan reads 43242 blocks from disk for processedorder, which takes 1.162 seconds

So there can be two explanations:

in the fast case, these blocks are actually cached in the kernel cache, so it is a caching effect

Experiment by clearing the kernel cache.
random I/O is more expensive than the optimizer thinks it is

In that case, you could consider raising random_page_cost.