Search code examples
databasepostgresqljsonb

Removing consecutive duplicates in a postgresql database where data is in json column


So I have a postgresql table called state_data where there are two columns: datetime and state. The state column is of jsonb type and specifies various state data for a given datetime. Here is an example of the table:

datetime            | state
================================================
2018-10-31 08:00:00 | {"temp":75.0,"location":1}
2018-10-31 08:01:00 | {"temp":75.0,"location":1}
2018-10-31 08:02:00 | {"temp":75.0,"location":1}
2018-10-31 08:03:00 | {"temp":75.0,"location":2}
2018-10-31 08:04:00 | {"temp":74.8,"location":1}
2018-10-31 08:05:00 | {"temp":74.8,"location":2}
2018-10-31 08:06:00 | {"temp":74.7,"location":1}

Over time this table will get very big - particularly I increase sampling frequency - and I really only want to store data where consecutive rows have different temperatures. So the table above would reduce to,

datetime            | state
================================================
2018-10-31 08:00:00 | {"temp":75.0,"location":1}
2018-10-31 08:04:00 | {"temp":74.8,"location":1}
2018-10-31 08:06:00 | {"temp":74.7,"location":1}

I know how to do this if the temperature data were in its own column, but is there a straightforward way to handle this operation and delete all consecutive duplicates based on an item within a json column?

What if I wanted to delete duplicates for both json items? For example,

datetime            | state
================================================
2018-10-31 08:00:00 | {"temp":75.0,"location":1}
2018-10-31 08:03:00 | {"temp":75.0,"location":2}
2018-10-31 08:04:00 | {"temp":74.8,"location":1}
2018-10-31 08:05:00 | {"temp":74.8,"location":2}
2018-10-31 08:06:00 | {"temp":74.7,"location":1}

Solution

  • Use the window function lag():

    select datetime, state
    from (
        select datetime, state, lag(state) over (order by datetime) as prev
        from state_data
        ) s
    where state->>'temp' is distinct from prev->>'temp'
    

    If the table has a primary key you should use it in the delete command. In the lack of a primary key you can cast state to jsonb:

    delete from state_data
    where (datetime, state::jsonb) not in (
        select datetime, state::jsonb
        from (
            select datetime, state, lag(state) over (order by datetime) as prev
            from state_data
            ) s
        where state->>'temp' is distinct from prev->>'temp'
    )