Search code examples
influxdb

Removing duplicates in InfluxDB


I would like to perform a query to remove duplicates. What I define as a duplicate here is a measurement where we have more than 1 data point. They will have different tags, so they are not overwritten by default but I would like to remove the oldest inserted, regardless of the tags.

So for example, measurement of logins (it doesn't really make sense but it's to avoid using abstract entities):

> Email   | Name    | TS         | Login Time
> 
> [email protected] | Alice   | xxxxx1000  | 2017-05-19
> [email protected] | Alice   | xxxxx1000  | 2017-05-18
> [email protected] | Alice   | xxxxx1000  | 2017-05-17
> [email protected] | Bob     | xxxxx1000  | 2017-05-18
> [email protected] | Charlie | xxxxx1200  | 2017-05-19

I would like to remove the second and third line, because the data point has the same timestamp as the first, it is the same measurement but they have different login times and I would like to take only the last.

I know well that I could solve this with a query, but the requirement is more complex than this (visualization in Grafana of weird KPI data) and I need to remove actual duplicates (generated and loaded twice).

Thank you.


Solution

  • You can fetch all login user names using group by and then order by time , so that the latest login time will come up first ,then you can delete the remaining ones.

    Also, you might need to copy your latest items to some another measurement , since you can't remove row in influxdb .

    For this you might use limit 1 offset 0 so that only the latest login time will come from the query output.

    Let me know, if I understand it correctly.