Search code examples
pythonpandasdataframegroup-concatisinstance

Pandas isin() output to string and general code optimisation


I am just starting to use python to do some analysis of data at work, so I could really use some help here :)

I have a df with African countries and a bunch of indicators and another df with dimensions representing groupings, and if a country is in that group, the name of the country is in there.

Here's a snapshot:

# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'], 
                   'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'], 
                   'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']

df_indicators

and want to add dimensions to this df that tell me if the country is a fragile state or not (and other groupings). So I have another df with the list of countries belonging to that category.

I use the isin instance to check for equality, but what I would really like is that instead of TRUE and FALSE in the new dimension "Fragile" for example, TRUE values would be replaced by "Fragile" and FALSE values by "NOT FRAGILE".

It goes without saying that if you seen any way to improve this code I am very eager to learn from professionals! Especially if you are in the domain of sustainable development goal statistics.

import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)

# Title countries in the Sub-Saharan Africa dimension

decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap

# Create list of legal country names

legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo", 
              "Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
              "Sao Tome and Principe", "Gambia, The":"Gambia"}

# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times

groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings

# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...? 

indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])

# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()

frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)

result

enter image description here


Solution

  • You can apply a function on the serie instead of isin()

    def get_value(x, y, choice):
        if x in y:
            return choice[0]
        else:
            return choice[1]
    
    indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])
    

    I'm not 100% sure that you need tolist() but this code will apply the function for every lines of your dataframe and return either the choice 1 if True or 2 if False.

    I hope it helps,