The following query finishes in 1.5s (which is ok, the table contains about 500M rows):
explain (analyze, buffers)
select sales.*
from sales
join product on (product.id = sales.productid)
join date on (date.id = sales.dateid)
where product.id = 24
order by date.timestamp
limit 200;
Query plan: http://explain.depesz.com/s/8Ix
Searching for product.name instead increases the runtime to totally unacceptable 200s:
explain (analyze, buffers)
select sales.*
from sales
join product on (product.id = sales.productid)
join date on (date.id = sales.dateid)
where product.name = 'new00000006'
order by date.timestamp
limit 200;
Query plan: http://explain.depesz.com/s/0RfQ
Note that the product named 'new00000006' has id 24 (same id as in the fast query above). Proof:
select name from product where id = 24;
name
-------------
new00000006
Why does that query take 200 times longer than the first query?
Another interesting modification of this query is this.. instead of product.id = 24 (like in the first query), I use product.id = (select 24). This also takes 200s to run (it actually results in the same bad query plan as when searching for product.name):
explain (analyze, buffers)
select sales.*
from sales
join product on (product.id = sales.productid)
join date on (date.id = sales.dateid)
where product.id = (select 24)
order by date.timestamp
limit 200;
Query plan: http://explain.depesz.com/s/K3VO
The statistics table shows that product id 24 is "rare":
select most_common_vals from pg_stats where tablename='sales' and attname='productid';
{19,2,7,39,40,14,33,18,8,37,16,48,6,23,49,29,46,41,20,53,47,26,38,1,32,42,56,57,10,15,27,50,30,45,51,58,17,36,4,25,44,43,5,22,11,35,52,9,21,12,24,31,28,54,34,3,55,13}
select most_common_freqs from pg_stats where tablename='sales' and attname='productid';
{0.020225,0.020119,0.0201133,0.0201087,0.0201,0.0200903,0.0200843,0.020069,0.0200557,0.0200477,0.0200427,0.0200303,0.0200197,0.020019,0.020012,0.0200107,0.0200067,0.020006,0.019995,0.0199947,0.0199917,0.019986,0.019986,0.0199777,0.0199747,0.0199713,0.0199693,0.019969,0.019967,0.019962,0.0199607,0.0199603,0.01996,0.0199567,0.0199567,0.0199533,0.019952,0.019951,0.0199467,0.019944,0.019944,0.01993,0.0199297,0.0199257,0.0199223,0.0199143,0.01989,0.0198887,0.019883,0.0198747,6.7e-005,6e-005,5.9e-005,5.6e-005,5.46667e-005,5.43333e-005,5.13333e-005,4.96667e-005}
Product id 24 has a frequency of 6.7e-005 (it's a "new product"), while older products have frequencies of about 0.01.
Statistics show that the first query plan (the one that runs in 1.5s) makes perfect sense. It uses the sales_productid_index to quickly find the sales of this product. Why isn't the same query plan used in the other two cases? It seems like statistics are ignored.
Table definitions (slightly obfuscated / renamed):
Tabelle äpublic.salesô
Spalte | Typ | Attribute | Speicherung | Statistikziel | Beschreibung
-----------+---------+-----------+-------------+---------------+--------------
id | uuid | not null | plain | |
dateid | integer | | plain | 10000 |
productid | integer | | plain | 10000 |
a | text | | extended | 10000 |
b | integer | | plain | 10000 |
x1 | boolean | | plain | |
x2 | boolean | | plain | |
x3 | boolean | | plain | |
x4 | boolean | | plain | |
x5 | boolean | | plain | |
Indexe:
"sales_pkey" PRIMARY KEY, btree (id)
"sales_a_index" btree (a)
"sales_b_index" btree (b)
"sales_dateid_index" btree (dateid)
"sales_productid_index" btree (productid)
Fremdschlnssel-Constraints:
"sales_dateid_fkey" FOREIGN KEY (dateid) REFERENCES date(id)
"sales_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id)
Hat OIDs: nein
Tabelle äpublic.productô
Spalte | Typ | Attribute | Speicherung | Statistikziel | Beschreibung
--------+---------+----------------------------------------------------------+-------------+---------------+--------------
id | integer | not null Vorgabewert nextval('product_id_seq'::regclass) | plain | |
name | text | | extended | |
Indexe:
"product_pkey" PRIMARY KEY, btree (id)
"product_name_index" UNIQUE, btree (name)
Fremdschlnsselverweise von:
TABLE "sales" CONSTRAINT "sales_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id)
TABLE "salesaggr" CONSTRAINT "salesaggr_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id)
Hat OIDs: nein
Version: PostgreSQL 9.3.1, compiled by Visual C++ build 1600, 64-bit
Config: default configuration except for maintenance_work_mem, which has been increased to 1GB.
Operating system: Microsoft Windows [Version 6.2.9200]
Amount and size of RAM installed: 32GB
Storage: a single 1TB SSD
In your first query, the planner takes a shortcut and uses the sales_productid_index available on sales.productid since it is told that sales.productid = product.id. The only thing the join with product actually does in this query is assuring that a row with id = 24 actually exists in the table.
In the second query, this shortcut isn't available. The planner could choose to go to product, get the id and then scan sales using the index on productid, probably getting similar performance, but because he doesn't know that name='new00000006' will lead to id=24, he can't guess how many rows in sales this path would result in*. For all he knows he'd be index scanning a significant part of the 300M rows sales table.
*Note that in the first query, he guesses that productid=24 will result in 42393 rows, while getting 34560 rows. Quite close considering the table has 300M rows.