Search code examples
rggmapgeocode

R: How to correctly prepare an international address for ggmap geocode / Google Geocode API?


I've found that

loc <- "Dradenaustraße 33, 21129 Hamburg"
library(ggmap)
geocode(loc, source = "google", force = TRUE, messaging = TRUE, output = "more")

returns either NAs, a "400, Bad Request" error, or if one tries to prepare the address as shown below, it even returns some Lat / Lon somewhere in Kansas.

I've found that,

loc <- "Dradenaustraße 33, 21129 Hamburg"
Encoding(loc) <- "UTF-8"
loc <- URLencode(loc, reserved = TRUE)

returns

Warning message:
In strsplit(URL, "") : input string 1 is invalid UTF-8

and loc will be NA afterwards.

Btw. the following works fine with geocode, i.e. it returns the correct address and lat/lon:

loc <- "Dradenaustrasse 33, 21129 Hamburg" #manually reformatted
loc <- "Dradenaustraee 33, 21129 Hamburg" #misspelled

The following misspelled address has the same problems as the initial normal spelling:

loc <- "Dradenaustraée 33, 21129 Hamburg" #misspelled

I'm calling the geocode api with many thousands of addresses like the one above and don't want to reformat them (i.e. replace "ß" with "ss" unless this is absolutely necessary. In that case, I would have to make assumptions about many other international addresses containing accents (`, ´, etc.) as well.

Any ideas?

Many thanks! :)

Edited to point out that I'm looking for a solution that arbitrarily workds for international addresses and doesn't require domain specific knownledge and manual reformatting of addresses.


Solution

  • This is an encoding problem which are famously tricky. Your original text is not in utf-8 and that is what google is looking for. Setting the encoding is only attempting to attach metadata to the string. This:

    Encoding(loc) <- "UTF-8"
    

    Is telling R that the string is "UTF-8" and R is saying it is not. Your German words are probably in "Latin9" encoding, which the R function Encoding() has no idea what that is.

    What you can do is translate the encoding to UTF-8 before sending it to google. You can do this in-band before each call such as:

    loc <- "Dradenaustraße 33, 21129 Hamburg"
    utf_encoded <- //some translation algorithm
    geocode(utf_encoded, source = "google", force = TRUE, messaging = TRUE, output = "more")
    

    Or you can attempt to create a second data store (files, database tables, etc.) which is created by ingesting the German text through file or database reading, running it all through the algorithm, and outputting the equivalent text encoded as UTF-8 through file or database writing.

    Either way, there's a string transformation / translation package found here. The relevant part of that page looks to be:

    stri_trans_general("groß", "upper")

    ## "GROSS"