I am creating a dataset collecting data for a given set of countries. To avoid any ambiguity, I would like to use a pycountry.db.Country
object to represent each country.
However, when setting the country as the index of my pd.DataFrame
, I can't select (.loc[]
) a record by passing a country, I'm getting this type of error — despite the record existing:
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
How to select a record in my pd.DataFrame
, given a pycountry.db.Country
object?
Here is a working example:
import pandas as pd
import pycountry
aruba: pycountry.db.Country = pycountry.countries.get(alpha_3="ABW")
belgium: pycountry.db.Country = pycountry.countries.get(alpha_3="BEL")
canada: pycountry.db.Country = pycountry.countries.get(alpha_3="CAN")
data: list[dict] = [
{"country": aruba, "population": 106_203},
{"country": belgium, "population": 11_429_336},
{"country": canada, "population": 37_058_856},
]
df: pd.DataFrame = pd.DataFrame(data)
df.set_index("country", inplace=True)
# df.index = df.index.astype(dtype="category") # optional: doesn't change the outcome
assert df.index[1] == belgium
assert df.index[1] is belgium
belgium_data = df.loc[belgium] # <-- fails with "None of [Index([('alpha_2', 'BE'),\n('alpha_3', 'BEL'),\n('flag', '🇧🇪'),\n('name', 'Belgium'),\n('numeric', '056'),\n('official_name', 'Kingdom of Belgium')],\ndtype='object', name='country')] are in the [index]"
Pandas treats your object as a list-like object, which is why you cannot use it as a key for loc, since it will try to iterate over the objects in the list.
>>> from pandas.core.dtypes.common import is_list_like, is_scalar
>>> is_scalar(belgium)
False
>>> is_list_like(belgium)
True
See What datatype is considered 'list-like' in Python? for more about is_list_like
Interestingly, this works:
>>> df.loc[[belgium]].iloc[0]
population 11429336
Name: Country(alpha_2='BE', alpha_3='BEL', flag='🇧🇪', name='Belgium', numeric='056', official_name='Kingdom of Belgium'), dtype: int64
so if you really really want to use the object as an index, you can work around it with this.
Or, getting even more ridiculous, making the object not iterable:
>>> belgium.__iter__ = None
>>> df.loc[belgium]
population 11429336
Name: Country(__iter__=None, alpha_2='BE', alpha_3='BEL', flag='🇧🇪', name='Belgium', numeric='056', official_name='Kingdom of Belgium'), dtype: int64
But I'm sure this would break some other functionality on your code, since __iter__
seems to be implemented on Country
to make it possible to cast it to dict
easily.
Objects as an index is not maybe the best of practices if your dataset is large. What I would recommend would be to use for example alpha_3
as the index instead, and keep the object in a separate column. You would still avoid ambiguity, but would not get in trouble with overly complex index types.
data: list[dict] = [
{"index": aruba.alpha_3, "country": aruba, "population": 106_203},
{"index": belgium.alpha_3, "country": belgium, "population": 11_429_336},
{"index": canada.alpha_3, "country": canada, "population": 37_058_856},
]
df: pd.DataFrame = pd.DataFrame(data)
df.set_index("index", inplace=True)
assert df.loc[belgium.alpha_3]["country"] == belgium