I have a dataframe with about 100k rows and I would like to find the coordinates based on the country short code with geopy.
The original dataframe look like this:
index | country | city |
---|---|---|
0 | CL | SANTIAGO |
1 | BR | SAO PAULO |
2 | BR | ITUPEVA |
To limit the number of requests I create another dataframe with unique "country" value, then look up location details from Geopy with following code:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
country_list = raw_df.country.unique().tolist()
short_code_df["location"] = short_code_df["short_code"].apply(geocode,language="en")
Which return me the below df:
| index | short_code | location |
|-------|------------|----------------------------------------------|
| 0 | CL | (Chile, (-31.7613365, -71.3187697)) |
| 1 | BR | (Brazil, (-10.3333333, -53.2)) |
| 2 | US | (United States, (39.7837304, -100.445882)) |
| 3 | GB | (United Kingdom, (54.7023545, -3.2765753)) |
| 4 | JP | (Japan, (36.5748441, 139.2394179)) |
| 5 | CH | (Switzerland, (46.7985624, 8.2319736)) |
| 6 | CN | (China, (35.000074, 104.999927)) |
| 7 | HK | (Hong Kong, China, (22.350627, 114.1849161)) |
What I would like is to return the Country, Lat, Lon in their respective column like below:
| index | short_code | location | lat | lon |
|-------|------------|----------------------|----------------|----------------|
| 0 | CL | Chile | -31.7613365 | - 71.3187697 |
| 1 | BR | Brazil | -10.3333333 | -53.2 |
| 2 | US | United States | 39.7837304 | 100.445882 |
I tried to slice it with a square bracket at the end but it throw me an error.
I also tried to use the pandas explode function but that was not working too.
Thanks first and appreciate your helps.
You can use Series.apply
to solve this problem. The approach is slightly different depending on the type of location
: does the column contain tuples or strings ?
location
contains tuples (tuple[str, tuple[float, float]):
import pandas as pd
df = pd.DataFrame(
{
"short_code": ["CL", "BR", "US"],
"location": [
("Chile", (-31.7613365, -71.3187697)),
("Brazil", (-10.3333333, -53.2)),
("United States", (39.7837304, -100.445882))
]
}
)
Using apply:
df["country"] = df["location"].apply(lambda x: x[0])
df["lat"] = df["location"].apply(lambda x: x[1][0])
df["lon"] = df["location"].apply(lambda x: x[1][1])
df = pd.DataFrame(
{
"short_code": ["CL", "BR", "US"],
"location": [
"(Chile, (-31.7613365, -71.3187697))",
"(Brazil, (-10.3333333, -53.2))",
"(United States, (39.7837304, -100.445882))"
]
}
)
In this case you can first use regular expression to parse location
and transform it as tuples, then apply the same steps as above
import re
import ast
def parse_location_str(loc_str: str):
pattern = re.compile(r'\((?P<country>.*?),(?P<coord>.*)\)')
m = pattern.search(loc_str)
if m is None:
return None
country = m.group('country')
coords = ast.literal_eval(m.group('coord'))
return (country, coords)
df['loc_parsed'] = df['location'].apply(parse_location_str)
df["country"] = df["loc_parsed"].apply(lambda x: x[0])
df["lat"] = df["loc_parsed"].apply(lambda x: x[1][0])
df["lon"] = df["loc_parsed"].apply(lambda x: x[1][1])