sql postgresql aggregate group-by postgresql-performance

Efficiently querying a huge time series table for one row every 15 minutes

I have two tables, conttagtable (t) and contfloattable (cf). T has about 43k rows. CF has over 9 billion.

I created an index on both tables on the tagindex column on both tables. This column can be thought of as a unique identifier for conttagtable and as a foreign key into conttagtable for confloattable. I didn't explicitly create a PK or foreign key on either table relating to the other, although this data is logically related by the tagindex column on both tables as if conttagtable.tagindex were a PRIMARY KEY and contfloattable.tagindex where a FOREIGN KEY (tagindex) REFERENCES conttagtable(tagindex). The data came from a microsoft access dump and I didn't know if I could trust tagindex to be unique, so "uniqueness" is not enforced.

The data itself is extremely large.

I need to obtain single arbitrarily selected row from contfloattable for each 15-minute contfloattable.dateandtime interval for each conttagtable.tagid. So, if the contfloattable for a given tagid has 4000 samples spanning 30 minutes, I need a sample from the 0-14 minute range and a sample from the 15-30 minute range. Any one sample within the 15 minute range is acceptable; 1st, last, random, whatever.

In a nutshell, I need to get a sample every 15 minutes but only one sample per t.tagname. The samples right now are recorded every 5 seconds and the data spans two years. This is a big data problem and way over my head in terms of sql. All of the time interval solutions I have tried from googling or searching on SO have yielded query times that are so long that they are not practical.

Are my indexes sufficient for a fast join? (they appear to be when leaving out the time interval part)
Would I benefit by the addition of any other indexes?
What's the best/fastest query that accomplish the above goals?

Here's an SQLFiddle containing the schema and some sample data: http://sqlfiddle.com/#!1/c7d2f/2

Schema:

        Table "public.conttagtable" (t)
   Column    |  Type   | Modifiers
-------------+---------+-----------
 tagname     | text    |
 tagindex    | integer |
 tagtype     | integer |
 tagdatatype | integer |
Indexes:
    "tagindex" btree (tagindex)


             Table "public.contfloattable" (CF)
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 dateandtime | timestamp without time zone |
 millitm     | integer                     |
 tagindex    | integer                     |
 Val         | double precision            |
 status      | text                        |
 marker      | text                        |
Indexes:
    "tagindex_contfloat" btree (tagindex)

The output i'd like to see is something like this:

cf.dateandtime      |cf."Val"|cf.status|t.tagname
--------------------------------------------------
2012-11-16 00:00:02  45       S         SuperAlpha
2012-11-16 00:00:02  45       S         SuperBeta
2012-11-16 00:00:02  45       S         SuperGamma
2012-11-16 00:00:02  45       S         SuperDelta
2012-11-16 00:15:02  45       S         SuperAlpha
2012-11-16 00:15:02  45       S         SuperBeta
2012-11-16 00:15:02  45       S         SuperGamma
2012-11-16 00:15:02  45       S         SuperDelta
2012-11-16 00:30:02  45       S         SuperAlpha
2012-11-16 00:30:02  45       S         SuperBeta
2012-11-16 00:30:02  45       S         SuperGamma
2012-11-16 00:30:02  45       S         SuperDelta
2012-11-16 00:45:02  42       S         SuperAlpha

...etc etc...

As suggested by Clodoaldo, this is my latest attempt, any suggestions to speed it up?

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by floor(extract(epoch from cf.dateandtime) / 60 / 15), cf.tagindex

Query plan from the above: http://explain.depesz.com/s/loR

Solution

For 15 minutes intervals:

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname

Show the explain output for this query (if it works) so we can try to optimize. You can post it in this answer.

Explain Output

"Sort  (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49481978.32 rows=19436288 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2741707.02..2790297.74 rows=19436288 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..388725.76 rows=19436288 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

It looks like you need this index:

create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

Run analyze after creating it. Now notice that any index on a big table will have a significant performance impact on data changes (insert etc) as it will have to be updated at each change.

Update

I added the cf_tag_datetime index (tagindex,dateandtime) and here's the new explain:

"Sort  (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

It seems to have gone up in time :( However, if I remove the order by clause (not exactly what i need, but would work), this is what happens, big reduction:

"Hash Join  (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
"  Hash Cond: (cf.tagindex = t.tagindex)"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"        Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"        ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"              Sort Key: i.tagindex, i.dateandtime"
"              ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"        ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"              ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                    Sort Key: cf.tagindex, cf.dateandtime"
"                    ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"  ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"        ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

I have not yet tried this index...will do so though. standby.

Now looking at it again I think the inverse index could be even better as it can be used not only in the Merge Join but also in the final Sort:

create index cf_tag_datetime on contfloattable (dateandtime, tagindex)