Search code examples
sqlpostgresqlelasticsearchwindow-functions

Rank over partition from postgresql in elasticsearch


We are facing a problem with migration a large data set into elasticsearch from postgres (backup or whatever).

We have schema similar like this

+---------------+--------------+------------+-----------+
|    user_id    |  created_at  |  latitude  | longitude |
+---------------+--------------+------------+-----------+
|       5       |  23.1.2015   |    12.49   |    20.39  |
+---------------+--------------+------------+-----------+
|       2       |  23.1.2015   |    12.42   |    20.32  |
+---------------+--------------+------------+-----------+
|       2       |  24.1.2015   |    12.41   |    20.31  |
+---------------+--------------+------------+-----------+
|       5       |  25.1.2015   |    12.45   |    20.32  |
+---------------+--------------+------------+-----------+
|       1       |  23.1.2015   |    12.43   |    20.34  |
+---------------+--------------+------------+-----------+
|       1       |  24.1.2015   |    12.42   |    20.31  |
+---------------+--------------+------------+-----------+

And we are able to find a latest position by created_at thanks to rank function in SQL

... WITH locations AS ( 
select user_id, lat, lon, rank() over (partition by user_id order by created_at) as r
FROM locations)
SELECT user_id, lat, lon FROM locations WHERE r = 1

and the result is only newest created locations for each user:

+---------------+--------------+------------+-----------+
|    user_id    |  created_at  |  latitude  | longitude |
+---------------+--------------+------------+-----------+
|       2       |  24.1.2015   |    12.41   |    20.31  |
+---------------+--------------+------------+-----------+
|       5       |  25.1.2015   |    12.45   |    20.32  |
+---------------+--------------+------------+-----------+
|       1       |  24.1.2015   |    12.42   |    20.31  |
+---------------+--------------+------------+-----------+

After we import the data into elasticsearch, our document model looks like:

{
   "location" : { "lat" : 12.45, "lon" : 46.84 },
   "user_id"  : 5,
   "created_at" : "2015-01-24T07:55:20.606+00:00"
}
etc...

I am looking for alternatives for this SQL query in elasticsearch query, I think it must be possible, but i did not find how yet.


Solution

  • You can achieve this using field collapsing clubbed with inner_hits.

    {
        "collapse": {
            "field": "user_id",
            "inner_hits": {
                "name": "order by created_at",
                "size": 1,
                "sort": [
                    {
                        "created_at": "desc"
                    }
                ]
            }
        },
    }
    

    Detailed Article: https://blog.francium.tech/sql-window-function-partition-by-in-elasticsearch-c2e3941495b6