Return pandas column base on country short code with Geopy

I have a dataframe with about 100k rows and I would like to find the coordinates based on the country short code with geopy.

The original dataframe look like this:

index	country	city
0	CL	SANTIAGO
1	BR	SAO PAULO
2	BR	ITUPEVA

To limit the number of requests I create another dataframe with unique "country" value, then look up location details from Geopy with following code:

from geopy.geocoders import Nominatim    
from geopy.extra.rate_limiter import RateLimiter    
country_list = raw_df.country.unique().tolist()    
short_code_df["location"] = short_code_df["short_code"].apply(geocode,language="en")

Which return me the below df:

| index | short_code |                   location                   |
|-------|------------|----------------------------------------------|
|     0 | CL         | (Chile, (-31.7613365, -71.3187697))          |
|     1 | BR         | (Brazil, (-10.3333333, -53.2))               |
|     2 | US         | (United States, (39.7837304, -100.445882))   |
|     3 | GB         | (United Kingdom, (54.7023545, -3.2765753))   |
|     4 | JP         | (Japan, (36.5748441, 139.2394179))           |
|     5 | CH         | (Switzerland, (46.7985624, 8.2319736))       |
|     6 | CN         | (China, (35.000074, 104.999927))             |
|     7 | HK         | (Hong Kong, China, (22.350627, 114.1849161)) |

What I would like is to return the Country, Lat, Lon in their respective column like below:

| index | short_code |      location        |      lat       |        lon     |
|-------|------------|----------------------|----------------|----------------|
|     0 | CL         | Chile                |    -31.7613365 |   - 71.3187697 |
|     1 | BR         | Brazil               |    -10.3333333 |          -53.2 |
|     2 | US         | United States        |     39.7837304 |     100.445882 |

I tried to slice it with a square bracket at the end but it throw me an error.

I also tried to use the pandas explode function but that was not working too.

Thanks first and appreciate your helps.

Solution

You can use Series.apply to solve this problem. The approach is slightly different depending on the type of location: does the column contain tuples or strings ?

If `location` contains tuples (tuple[str, tuple[float, float]):


import pandas as pd

df = pd.DataFrame(
    {
        "short_code": ["CL", "BR", "US"],
        "location": [
        ("Chile", (-31.7613365, -71.3187697)),
        ("Brazil", (-10.3333333, -53.2)),
        ("United States", (39.7837304, -100.445882))
    ]
    }
)

Using apply:


df["country"] = df["location"].apply(lambda x: x[0])
df["lat"] = df["location"].apply(lambda x: x[1][0])
df["lon"] = df["location"].apply(lambda x: x[1][1])

you get:

If location is a string

df = pd.DataFrame(
    {
        "short_code": ["CL", "BR", "US"],
        "location": [
        "(Chile, (-31.7613365, -71.3187697))",
        "(Brazil, (-10.3333333, -53.2))",
        "(United States, (39.7837304, -100.445882))"
    ]
    }
)

In this case you can first use regular expression to parse location and transform it as tuples, then apply the same steps as above

import re
import ast


def parse_location_str(loc_str: str):
    pattern = re.compile(r'\((?P<country>.*?),(?P<coord>.*)\)')
    m = pattern.search(loc_str)
    if m is None:
        return None
    country = m.group('country')
    coords = ast.literal_eval(m.group('coord'))
    return (country, coords)

df['loc_parsed'] = df['location'].apply(parse_location_str)
df["country"] = df["loc_parsed"].apply(lambda x: x[0])
df["lat"] = df["loc_parsed"].apply(lambda x: x[1][0])
df["lon"] = df["loc_parsed"].apply(lambda x: x[1][1])

You get:

Return pandas column base on country short code with Geopy

If location contains tuples (tuple[str, tuple[float, float]):

If location is a string

If `location` contains tuples (tuple[str, tuple[float, float]):