I'm currently working with some large datasets that include some location based information but lack direct latitude and longitude measurements which I need in order to create visualizations.
In order to resolve this problem, I've been using geocoding APIs that require addresses or address-like information as input and provide latitude and longitude information as output.
I started by using the Nominatim API. Unfortunately, due to the nature of the address-like data that I have, many of my queries failed so I started using the Google geocoding API. The Google API provided me with a significantly higher success rate, but it is a paid API which is not ideal.
I realize that given the incredible resources that Google has that it would be virtually impossible to build a system that rivals their geocoding API within a reasonable amount of time, but it's made me wonder what's going on under the hood.
Is a BERT-like translational system at work? What happens to the text after it's sent off?
I'm using n-grams for similar usage by creating an index and an inverted index. See this package ngram
import ngram
...
country = filename.replace('.csv', '')
ind[country] = ngram.NGram()
inv[country] = {}
s_csv = csv.reader(stream, delimiter=';')
next(s_csv)
for row in s_csv:
coord = tuple(map(float, row[0:2]))
ad = ' '.join(row[2:]).lower()
ind[country].add(ad)
inv[country][ad] = (coord, address)
then you can use the find
function
Take care of the memory consumption ~16GB RAM for a country like France and OSM Data
To see an implementation of that, check this OpenGeoCode HTTP API Service source code