Search code examples
amazon-web-serviceselasticsearchgeolocationlocationgeojson

Using Elastic Search Geo Functionality To Find Most Common Locations?


I have a geojson file containing a list of locations each with a longitude, latitude and timestamp. Note the longitudes and latitudes are multiplied by 10000000.

{
  "locations" : [ {
    "timestampMs" : "1461820561530",
    "latitudeE7" : -378107308,
    "longitudeE7" : 1449654070,
    "accuracy" : 35,
    "junk_i_want_to_save_but_ignore" : [ { .. } ]
  }, {
    "timestampMs" : "1461820455813",
    "latitudeE7" : -378107279,
    "longitudeE7" : 1449673809,
    "accuracy" : 33
  }, {
    "timestampMs" : "1461820281089",
    "latitudeE7" : -378105184,
    "longitudeE7" : 1449254023,
    "accuracy" : 35
  }, {
    "timestampMs" : "1461820155814",
    "latitudeE7" : -378177434,
    "longitudeE7" : 1429653949,
    "accuracy" : 34
  }
  ..

Many of these locations will be the same physical location (e.g. the user's home) but obviously the longitude and latitudes may not be exactly the same.

I would like to use Elastic Search and it's Geo functionality to produce a ranked list of most common locations where locations are deemed to be the same if they are within, say, 100m of each other?

For each common location I'd also like the list of all timestamps they were at that location if possible!

I'd very much appreciate a sample query to get me started!

Many thanks in advance.


Solution

  • In order to make it work you need to modify your mapping like this:

    PUT /locations
    {
      "mappings": {
        "location": {
          "properties": {
            "location": {
              "type": "geo_point"
            },
            "timestampMs": {
              "type": "long"
            },
            "accuracy": {
              "type": "long"
            }
          }
        }
      }
    }
    

    Then, when you index your documents, you need to divide the latitude and longitude by 10000000, and index like this:

    PUT /locations/location/1
    {
      "timestampMs": "1461820561530",
      "location": {
        "lat": -37.8103308,
        "lon": 14.4967407
      },
      "accuracy": 35
    }
    

    Finally, your search query below...

    POST /locations/location/_search
    {
      "aggregations": {
        "zoomedInView": {
          "filter": {
            "geo_bounding_box": {
              "location": {
                "top_left": "-37, 14",
                "bottom_right": "-38, 15"
              }
            }
          },
          "aggregations": {
            "zoom1": {
              "geohash_grid": {
                "field": "location",
                "precision": 6
              },
              "aggs": {
                "ts": {
                  "date_histogram": {
                    "field": "timestampMs",
                    "interval": "15m",
                    "format": "DDD yyyy-MM-dd HH:mm"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    ...will yield the following result:

    {
      "aggregations": {
        "zoomedInView": {
          "doc_count": 1,
          "zoom1": {
            "buckets": [
              {
                "key": "k362cu",
                "doc_count": 1,
                "ts": {
                  "buckets": [
                    {
                      "key_as_string": "Thu 2016-04-28 05:15",
                      "key": 1461820500000,
                      "doc_count": 1
                    }
                  ]
                }
              }
            ]
          }
        }
      }
    }
    

    UPDATE

    According to our discussion, here is a solution that could work for you. Using Logstash, you can call your API and retrieve the big JSON document (using the http_poller input), extract/transform all locations and sink them to Elasticsearch (with the elasticsearch output) very easily.

    Here is how it goes in order to format each event as depicted in my initial answer.

    1. Using http_poller you can retrieve the JSON locations (note that I've set the polling interval to 1 day, but you can change that to some other value, or simply run Logstash manually each time you want to retrieve the locations)
    2. Then we split the locations array into individual events
    3. Then we divide the latitude/longitude fields by 10,000,000 to get proper coordinates
    4. We also need to clean it up a bit by moving and removing some fields
    5. Finally, we just send each event to Elasticsearch

    Logstash configuration locations.conf:

    input {
      http_poller {
        urls => {
          get_locations => {
            method => get
            url => "http://your_api.com/locations.json"
            headers => {
              Accept => "application/json"
            }
          }
        }
        request_timeout => 60
        interval => 86400000
        codec => "json"
      }
    }
    filter {
      split {
        field => "locations" 
      }
      ruby {
        code => "
          event['location'] = {
            'lat' => event['locations']['latitudeE7'] / 10000000.0,
            'lon' => event['locations']['longitudeE7'] / 10000000.0
          }
        "
      }
      mutate {
        add_field => {
          "timestampMs" => "%{[locations][timestampMs]}"
          "accuracy" => "%{[locations][accuracy]}"
          "junk_i_want_to_save_but_ignore" => "%{[locations][junk_i_want_to_save_but_ignore]}"
        }
        remove_field => [
          "locations", "@timestamp", "@version" 
        ]
      }
    }
    output {
      elasticsearch {
        hosts => ["localhost:9200"]
        index => "locations"
        document_type => "location"
      }
    }
    

    You can then run with the following command:

    bin/logstash -f locations.conf
    

    When that has run, you can launch your search query and you should get what you expect.