Search code examples
pythonpandasdataframeindexingcountry-codes

How to use `pycountry.db.Country` objects as a `pd.DataFrame` index?


I am creating a dataset collecting data for a given set of countries. To avoid any ambiguity, I would like to use a pycountry.db.Country object to represent each country.

However, when setting the country as the index of my pd.DataFrame, I can't select (.loc[]) a record by passing a country, I'm getting this type of error — despite the record existing:

raise KeyError(f"None of [{key}] are in the [{axis_name}]")

How to select a record in my pd.DataFrame, given a pycountry.db.Country object?

Here is a working example:

import pandas as pd
import pycountry

aruba: pycountry.db.Country = pycountry.countries.get(alpha_3="ABW")
belgium: pycountry.db.Country = pycountry.countries.get(alpha_3="BEL")
canada: pycountry.db.Country = pycountry.countries.get(alpha_3="CAN")

data: list[dict] = [
    {"country": aruba, "population": 106_203},
    {"country": belgium, "population": 11_429_336},
    {"country": canada, "population": 37_058_856},
]

df: pd.DataFrame = pd.DataFrame(data)
df.set_index("country", inplace=True)
# df.index = df.index.astype(dtype="category")  # optional: doesn't change the outcome

assert df.index[1] == belgium
assert df.index[1] is belgium

belgium_data = df.loc[belgium]  # <-- fails with "None of [Index([('alpha_2', 'BE'),\n('alpha_3', 'BEL'),\n('flag', '🇧🇪'),\n('name', 'Belgium'),\n('numeric', '056'),\n('official_name', 'Kingdom of Belgium')],\ndtype='object', name='country')] are in the [index]"

Solution

  • Explanation

    Pandas treats your object as a list-like object, which is why you cannot use it as a key for loc, since it will try to iterate over the objects in the list.

    >>> from pandas.core.dtypes.common import is_list_like, is_scalar
    >>> is_scalar(belgium)
    False
    >>> is_list_like(belgium)
    True
    

    See What datatype is considered 'list-like' in Python? for more about is_list_like

    Workaround

    Interestingly, this works:

    >>> df.loc[[belgium]].iloc[0]
    population    11429336
    Name: Country(alpha_2='BE', alpha_3='BEL', flag='🇧🇪', name='Belgium', numeric='056', official_name='Kingdom of Belgium'), dtype: int64
    

    so if you really really want to use the object as an index, you can work around it with this.

    Or, getting even more ridiculous, making the object not iterable:

    >>> belgium.__iter__ = None
    >>> df.loc[belgium]
    population    11429336
    Name: Country(__iter__=None, alpha_2='BE', alpha_3='BEL', flag='🇧🇪', name='Belgium', numeric='056', official_name='Kingdom of Belgium'), dtype: int64
    

    But I'm sure this would break some other functionality on your code, since __iter__ seems to be implemented on Country to make it possible to cast it to dict easily.

    Recommendation

    Objects as an index is not maybe the best of practices if your dataset is large. What I would recommend would be to use for example alpha_3 as the index instead, and keep the object in a separate column. You would still avoid ambiguity, but would not get in trouble with overly complex index types.

    data: list[dict] = [
        {"index": aruba.alpha_3, "country": aruba, "population": 106_203},
        {"index": belgium.alpha_3, "country": belgium, "population": 11_429_336},
        {"index": canada.alpha_3, "country": canada, "population": 37_058_856},
    ]
    
    df: pd.DataFrame = pd.DataFrame(data)
    df.set_index("index", inplace=True)
    
    assert df.loc[belgium.alpha_3]["country"] == belgium