Search code examples
azure-cosmosdbshardingazure-cosmosdb-mongoapi

Azure CosmosDB - Sharding Strategy for georeferenced data?


I have a collection on cosmos for about 1 million documents like this:

{
   "title":"My Happy Bakery",
   "address":"155 Happy Avenue, Happy City",
   "location" : {
        "coordinates" : [
            -46.66357421875,
            -23.6149730682373
        ],
        "type" : "Point"
    }
}

Some extra points:

  1. The search pattern is based on a point and a radius, like "Restaurants near me" on the location field
  2. We are talking about restaurants and markets, so they are clustered on metropolitan areas... they don't occur in many places
  3. Some cities will have many of docs (like 70k) and others very few (like 10)
  4. Zip code has been deemed useful here because I can't filter by it... I can't determine which zip codes are inside a Xkm radius circle...
  5. City also a problem because here in BR, a lot of cities are "continuous", meaning, a 1km radius search will hit a lot of documents located in different cities/counties
  6. State is not a good choice because there's a massive concentration of documents in 4-5 states. There's also the challenge of determining the state you're in based on your location... not impossible, but hard...

What would be a good (or ideal) sharding strategy for this scenario? This is my first time sharding this kind of scenario so I'm kinda lost...

Edit1:
Add extra points 3, 4, 5 and 6


Solution

  • I ended up using the distance from the equator in meters divided by 250. I used this func for the distance calculation:

    public static double Calculate(double lat)
    {
        double rad(double angle) => angle * 0.017453292519943295769236907684886127d; // = angle * Math.Pi / 180.0d
        double havf(double diff) => Math.Pow(Math.Sin(rad(diff) / 2d), 2); // = sin²(diff / 2)
        return 12745600 * Math.Asin(Math.Sqrt(havf(lat))); // earth radius 6.372,8km x 2 = 12745.6
    }
    

    At query time, we saw our request time drop from 2-3sec to 140ms. Prior to this sharding strategy we had a geoindex which helped a lot, but the sharding changed everything. BY THE WAY, this is valid for our collection where the documents are around 1KB and 400 RUs allocated