machine-learning google-api gis geocoding google-geocoder

Google geocoding API Inner Workings

I'm currently working with some large datasets that include some location based information but lack direct latitude and longitude measurements which I need in order to create visualizations.

In order to resolve this problem, I've been using geocoding APIs that require addresses or address-like information as input and provide latitude and longitude information as output.

I started by using the Nominatim API. Unfortunately, due to the nature of the address-like data that I have, many of my queries failed so I started using the Google geocoding API. The Google API provided me with a significantly higher success rate, but it is a paid API which is not ideal.

I realize that given the incredible resources that Google has that it would be virtually impossible to build a system that rivals their geocoding API within a reasonable amount of time, but it's made me wonder what's going on under the hood.

Is a BERT-like translational system at work? What happens to the text after it's sent off?

Solution

I'm using n-grams for similar usage by creating an index and an inverted index. See this package ngram

import ngram 
...

country = filename.replace('.csv', '')
ind[country] = ngram.NGram()
inv[country] = {}
s_csv = csv.reader(stream, delimiter=';')
next(s_csv)
for row in s_csv:
    coord = tuple(map(float, row[0:2]))
    ad = ' '.join(row[2:]).lower()
    ind[country].add(ad)
    inv[country][ad] = (coord, address)

then you can use the find function

Take care of the memory consumption ~16GB RAM for a country like France and OSM Data

To see an implementation of that, check this OpenGeoCode HTTP API Service source code