I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data
in a table called accounts
:
data: {
"lastUpdatedTime": "2016-12-26T12:09:43.901Z",
"UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
"data": {
"country": "UK",
"verified_at": "2017-01-01T23:49:10.217Z"
}
}
The data format cannot be changed since this is legacy information.
I need to obtain all accounts where the country is UK
, the verified_at
value is not null and the lastUpdatedTime
value is greater than some given value.
So far, I have the following query:
SELECT * FROM "accounts"
WHERE (data @> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;
And the following indexes:
"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))
I've managed to get the query time down to about 1000 to 4000ms
Here is the analyze from the query:
Bitmap Heap Scan on accounts (cost=41.31..6934.50 rows=9 width=1719)
(actual time=7.273..1067.657 rows=23190 loops=1)
Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
Rows Removed by Filter: 4
Heap Blocks: exact=16039
-> Bitmap Index Scan on accounts_idxgin_on_data (cost=0.00..41.30 rows=1773 width=0)
(actual time=4.618..4.618 rows=23194 loops=1)
Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Planning time: 0.448 ms
Execution time: 1069.344 ms
(9 rows)
I have the following questions
field is not null
query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at')
to filter out a large number of non-matching records, because much of my data doesn't have verified_at
as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.(data->'data' ? 'verified_at')
, I needed to add another index on ((data -> 'data'::text))
. I already had an index on gin (data)
, but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.After playing around a bit more, I've managed to reduce my query time from around 1000ms to 350ms by creating the following partial index:
CREATE INDEX index_accounts_partial_on_verified_at
ON accounts ((data->'data'->'verified_at'))
WHERE (data->'data'->>'verified_at') IS NOT NULL
AND (data->'data' ? 'verified_at')
AND (data->'data'->>'country' = 'UK');
I was able to hardcode some of the values in this index, such as country=UK
because I only need to consider UK
accounts for this query. I was also able to remove the index on ((data->'data'))
which was 258MB, and replace it with the partial index which is only 1360 kB!
For anyone interested, I found the details for building a partial JSONB index from here