I have two tables, conttagtable
(t) and contfloattable
(cf). T has about 43k rows. CF has over 9 billion.
I created an index on both tables on the tagindex
column on both tables. This column can be thought of as a unique identifier for conttagtable
and as a foreign key into conttagtable
for confloattable
. I didn't explicitly create a PK or foreign key on either table relating to the other, although this data is logically related by the tagindex
column on both tables as if conttagtable.tagindex
were a PRIMARY KEY
and contfloattable.tagindex
where a FOREIGN KEY (tagindex) REFERENCES conttagtable(tagindex)
. The data came from a microsoft access dump and I didn't know if I could trust tagindex to be unique, so "uniqueness" is not enforced.
The data itself is extremely large.
I need to obtain single arbitrarily selected row from contfloattable
for each 15-minute contfloattable.dateandtime
interval for each conttagtable.tagid
. So, if the contfloattable
for a given tagid
has 4000 samples spanning 30 minutes, I need a sample from the 0-14 minute range and a sample from the 15-30 minute range. Any one sample within the 15 minute range is acceptable; 1st, last, random, whatever.
In a nutshell, I need to get a sample every 15 minutes but only one sample per t.tagname. The samples right now are recorded every 5 seconds and the data spans two years. This is a big data problem and way over my head in terms of sql. All of the time interval solutions I have tried from googling or searching on SO have yielded query times that are so long that they are not practical.
Here's an SQLFiddle containing the schema and some sample data: http://sqlfiddle.com/#!1/c7d2f/2
Schema:
Table "public.conttagtable" (t)
Column | Type | Modifiers
-------------+---------+-----------
tagname | text |
tagindex | integer |
tagtype | integer |
tagdatatype | integer |
Indexes:
"tagindex" btree (tagindex)
Table "public.contfloattable" (CF)
Column | Type | Modifiers
-------------+-----------------------------+-----------
dateandtime | timestamp without time zone |
millitm | integer |
tagindex | integer |
Val | double precision |
status | text |
marker | text |
Indexes:
"tagindex_contfloat" btree (tagindex)
The output i'd like to see is something like this:
cf.dateandtime |cf."Val"|cf.status|t.tagname
--------------------------------------------------
2012-11-16 00:00:02 45 S SuperAlpha
2012-11-16 00:00:02 45 S SuperBeta
2012-11-16 00:00:02 45 S SuperGamma
2012-11-16 00:00:02 45 S SuperDelta
2012-11-16 00:15:02 45 S SuperAlpha
2012-11-16 00:15:02 45 S SuperBeta
2012-11-16 00:15:02 45 S SuperGamma
2012-11-16 00:15:02 45 S SuperDelta
2012-11-16 00:30:02 45 S SuperAlpha
2012-11-16 00:30:02 45 S SuperBeta
2012-11-16 00:30:02 45 S SuperGamma
2012-11-16 00:30:02 45 S SuperDelta
2012-11-16 00:45:02 42 S SuperAlpha
...etc etc...
As suggested by Clodoaldo, this is my latest attempt, any suggestions to speed it up?
with i as (
select cf.tagindex, min(dateandtime) dateandtime
from contfloattable cf
group by
floor(extract(epoch from dateandtime) / 60 / 15),
cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
contfloattable cf
inner join
conttagtable t on cf.tagindex = t.tagindex
inner join
i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by floor(extract(epoch from cf.dateandtime) / 60 / 15), cf.tagindex
Query plan from the above: http://explain.depesz.com/s/loR
For 15 minutes intervals:
with i as (
select cf.tagindex, min(dateandtime) dateandtime
from contfloattable cf
group by
floor(extract(epoch from dateandtime) / 60 / 15),
cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
contfloattable cf
inner join
conttagtable t on cf.tagindex = t.tagindex
inner join
i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname
Show the explain output for this query (if it works) so we can try to optimize. You can post it in this answer.
Explain Output
"Sort (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49481978.32 rows=19436288 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2741707.02..2790297.74 rows=19436288 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..388725.76 rows=19436288 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
It looks like you need this index:
create index cf_tag_datetime on contfloattable (tagindex, dateandtime)
Run analyze
after creating it. Now notice that any index on a big table will have a significant performance impact on data changes (insert etc) as it will have to be updated at each change.
Update
I added the cf_tag_datetime index (tagindex,dateandtime) and here's the new explain:
"Sort (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
It seems to have gone up in time :( However, if I remove the order by clause (not exactly what i need, but would work), this is what happens, big reduction:
"Hash Join (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
I have not yet tried this index...will do so though. standby.
Now looking at it again I think the inverse index could be even better as it can be used not only in the Merge Join
but also in the final Sort
:
create index cf_tag_datetime on contfloattable (dateandtime, tagindex)