Search code examples
amazon-web-servicesamazon-athena

How to extract time series data before and after a specific event by using AWS athena?


I have the storage that stores huge time series data and I can extract data through AWS athena. However, I don't have any ideas how I can extract time series data before and after a specific event by using AWS athena.

What query can achieve it ?

Does anyone have ideas and samples of query on Athena ?

For example, I have following input data.

<input data>

id | timestamp              | value | level |
---------------------------------------------
1  | 2021-04-01T12:00:00+00:00 | 100.0 | 1     |
2  | 2021-04-01T12:00:10+00:00 | 98.0  | 1     |
3  | 2021-04-01T12:00:20+00:00 | 99.5  | 1     |
...
58 | 2021-04-01T12:09:40+00:00 | 98.2  | 1     |
59 | 2021-04-01T12:09:50+00:00 | 95.3  | 1     |
60 | 2021-04-01T12:10:00+00:00 | 99.2  | 1     |
61 | 2021-04-01T12:10:10+00:00 | 97.6  | 2     |
62 | 2021-04-01T12:10:20+00:00 | 98.6  | 2     |
63 | 2021-04-01T12:10:30+00:00 | 98.3  | 2     |
64 | 2021-04-01T12:10:40+00:00 | 98.1  | 2     |
...
100 | 2021-04-01T12:16:40+00:00 | 97.6  | 2     |

What I want to do is to extract the records for 30 sec before and after level 1->2 change event.

In this case, the expected outputs are data from id:58 to id:64.


Solution

  • You can use 'lag' function to determine timestamps where the level changes:

    SELECT *
    FROM (SELECT *
          FROM (SELECT timestamp,
                       lag(level) OVER (order by timestamp) AS prev_level,
                       level
                FROM dataset)
          WHERE prev_level != level)
    

    And then use those timestamps to filter out the dataset. For example something like this:

    WITH dataset(id,timestamp,value,level) AS (
        VALUES 
    
    ('1',timestamp '2021-04-01 12:00:00+00:00',100.0,1),
    ('2',timestamp '2021-04-01 12:00:10+00:00',98.0,1),
    ('3',timestamp '2021-04-01 12:00:20+00:00',99.5,1),
    ('58',timestamp '2021-04-01 12:09:40+00:00',98.2,1),
    ('59',timestamp '2021-04-01 12:09:50+00:00',95.3,1),
    ('60',timestamp '2021-04-01 12:10:00+00:00',99.2,1),
    ('61',timestamp '2021-04-01 12:10:10+00:00',97.6,2),
    ('62',timestamp '2021-04-01 12:10:20+00:00',98.6,2),
    ('63',timestamp '2021-04-01 12:10:30+00:00',98.3,2),
    ('64',timestamp '2021-04-01 12:10:40+00:00',98.1,2),
    ('100',timestamp '2021-04-01 12:16:40+00:00',97.6,2)
    )
    
    
    SELECT *
    FROM dataset o
    WHERE EXISTS(
                  SELECT *
                  FROM (SELECT *
                        FROM (SELECT timestamp,
                                     lag(level) OVER (order by timestamp) AS prev_level,
                                     level
                              FROM dataset)
                        WHERE prev_level != level)
                  WHERE (o.level = level AND o.timestamp - timestamp < interval '30' second)
                     OR (o.level = prev_level AND timestamp - o.timestamp < interval '30' second)
              )
    

    Output:

    id timestamp value level
    59 2021-04-01 12:09:50.000 UTC 95.3 1
    60 2021-04-01 12:10:00.000 UTC 99.2 1
    61 2021-04-01 12:10:10.000 UTC 97.6 2
    62 2021-04-01 12:10:20.000 UTC 98.6 2
    63 2021-04-01 12:10:30.000 UTC 98.3 2