I have collected data from an API to build a history. Initially, I saved all values every five minutes. Later, I changed my program to only save data that has changed.
Now, I want to clean up my old data and remove all values where the count
has not changed from the previous record within the same account
and id
.
account id count time
42 12147 492 2015-09-20 11:31:14.0
42 12147 492 2015-09-20 11:36:19.0 // delete
13 12147 246 2015-09-20 11:31:14.0
2 12253 183 2015-09-20 11:36:19.0
2 19684 805 2015-09-20 12:00:41.0 // note in next comment
2 19684 810 2015-09-20 12:05:41.0
2 19684 805 2015-09-20 12:10:41.0 // we had this combination, but do not delete this record because the previous value was different
2 19684 805 2015-09-20 12:15:41.0 // delete
2 19684 805 2015-09-20 12:20:41.0 // delete
2 19684 806 2015-09-20 12:25:41.0
I tried to solve this with a group by
over account
, id
and count
. With that approach, however, it will delete non-consecutive duplicates—i.e., if a record has the same value again after some time, it will fall into the same group.
I also thought about writing a small script where I iterate over all data and delete the current row if account
, id
, and count
are the same as the previous record, but I'm curious if this be possible with an single SQL statement?
You can use the following query:
DELETE history
FROM history
INNER JOIN (SELECT MIN(time) AS minTime, account, id, count
FROM history
GROUP BY account, id, count) AS h
ON history.account = h.account AND history.id = h.id AND history.count = h.count
WHERE history.time > h.minTime
EDIT:
After the edit made I think there are still some errors in sample data of the OP (time
field should be in acending order).
Using an additional assumption of a PK present in the table, you can use the following query:
SELECT pk
FROM history AS h1
WHERE account = (SELECT account
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)
AND
id = (SELECT id
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)
AND
count = (SELECT count
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)
in order to identify to-de-deleted records (see this demo).
Now you can easily delete unwanted rows using NOT IN
operator:
DELETE FROM history
WHERE pk IN (
SELECT x.pk
FROM (
SELECT pk
FROM history AS h1
WHERE
account = (SELECT account
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)
AND
id = (SELECT id
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)
AND
count = (SELECT count
FROM history AS h2
WHERE h1.account = h2.account AND
h1.id = h2.id AND
h2.time < h1.time
ORDER BY time DESC
LIMIT 1)) AS x)
EDIT 2:
Using variables in order to located to-de-deleted pk
values might lead to query that is considerably faster:
SELECT pk
FROM (
SELECT pk, account, id, count, time,
@rn := IF (account = @acc AND id = @id AND count = @count,
@rn + 1, 1) AS rn,
@acc := account,
@id := id,
@count := count
FROM history
CROSS JOIN (SELECT @rn = 0, @acc = 0, @id = 0, @count = 0) AS vars
ORDER BY account, id, time, count ) AS t
WHERE t.rn > 1