Search code examples
javapythonelasticsearchgeolocation

Get more probable positions in geolocation cloud


I have no clue how to get some information about this problem I want to solve. I would like you to tell me the keywords to search or any information about this.

I have a cloud of geolocations (latitude, longitude) by user. I would like to know the middle point (not sure about this concept) of the different groups of locations.

For example, I register every minute the location of a wild animal so I get something like this.

What I want to process it's to get the most usual (?) group of points and the mid point (Something like this)

and know for example that the midpoint group one is the most "repeated" so it's where the animal sleeps, the second one is where the animal drinks and the third one is where eat (for example).

I have the points in a csv database so I can use elasticsearch, java or even python to get this information.

Any clue about this would be really interesting.


Solution

  • This is a typical clustering use case and I see essentially two options:

    1. Centroid-based clustering

    which could be approached in Elasticsearch through centroid aggregations.

    2. Density-based clustering

    DBC is a much better approach b/c it's outlier-based. Here's a python implementation. There might be better ones out there, incl. scikit's very own. Not overly familar w/ them so that's all I can say at this point.


    I'm here to talk about Elasticsearch so here's how you'd do option #1:

    1. Set up an index
    PUT animals
    {
      "mappings": {
        "properties": {
          "location": {
            "type": "geo_point"
          }
        }
      }
    }
    
    1. Add some locations to it
    POST _bulk
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[7.5146484375,51.17934297928927]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[7.207031249999999,50.94458443495011]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[7.734374999999999,51.069016659603896]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[7.536621093749999,50.94458443495011]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[8.525390625,51.16556659836182]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[9.55810546875,50.83369767098071]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[9.0087890625,51.138001488062564]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[10.21728515625,50.56928286558243]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[10.87646484375,50.84757295365389]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.25,50.84757295365389]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.09619140625,50.77815527465925]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.513671874999998,50.84757295365389]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.3818359375,50.708634400828224]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.00830078125,50.736455137010665]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.6455078125,51.52241608253253]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[10.78857421875,50.3734961443035]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[10.546875,49.96535590991311]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[10.01953125,49.681846899401286]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[9.29443359375,49.85215166776998]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[8.942871093749998,49.710272582105695]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[9.20654296875,49.5822260446217]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[8.98681640625,49.52520834197442]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[8.6572265625,49.603590524348704]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.546630859375,50.14874640066278]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.865234375,50.0289165635219]}
    {"index":{"_index":"animals","_type":"_doc"}}
    {"location":[11.42578125,50.52041218671901]}
    

    I'm using some randomized points in Germany based on your sketch: enter image description here

    1. Compute the centroids
    POST animals/_search
    {
      "size": 0,
      "aggs": {
        "weighted": {
          "geohash_grid": {
            "field": "location",
            "precision": 2
          },
          "aggs": {
            "centroid": {
              "geo_centroid": {
                "field": "location"
              }
            }
          }
        }
      }
    }
    

    This traverses all the points and not just those "clearly bound" in your sketch. This means there'll be outliers containing very few points that'll need to be skipped.

    So taking the buckets that Elasticsearch returns, filtering just the larger buckets (I'm using JS instead of python here), and converting them to geojson with TurfJS:

    turf.featureCollection(
      buckets.filter(p => p.doc_count > 3)
             .map(p => turf.point([
               p.centroid.location.lon,
               p.centroid.location.lat
             ])))
    

    yields the following:

    enter image description here

    As you can see, the "centers" are skewed b/c the concentrations are not "high enough". With more concentrated groups the algorithm gets better.

    But to be frank, DBSCAN is the way to go here, not weighted centroids.