Search code examples
pythonpandasdata-cleaning

Extract values from dataset to perform functions- multiple countries within dataset


My dataset looks as follows:

Country Year Value
Ireland 2010 9
Ireland 2011 11
Ireland 2012 14
Ireland 2013 17
Ireland 2014 20
France 2011 15
France 2012 19
France 2013 21
France 2014 28
Germany 2008 17
Germany 2009 20
Germany 2010 19
Germany 2011 24
Germany 2012 27
Germany 2013 32

My goal is to create a new dataset which tells me the % increase from the first year of available data for a given country, compared to the most recent, which would look roughly as follows:

Country % increase
Ireland 122
France 87
Germany 88

In essence, I need my code for each country in my dataset, to locate the smallest and largest value for year, then take the corresponding values within the value column and calculate the % increase.

I can do this manually, however I have a lot of countries in my dataset and am looking for a more elegant way to do it. I am trying to troubleshoot my code for this however I am not having much luck as of yet.

My code looks as follows at present:

df_1["Min_value"] = df.loc[df["Year"].min(),"Value"].iloc[0]
df_1["Max_value"] = df.loc[df["Year"].max(),"Value"].iloc[0]

df_1["% increase"] = ((df_1["Max_value"]-df_1["Min_value"])/df_1["Min_value"])*100

This returns an error:

AttributeError: 'numpy.float64' object has no attribute 'iloc'

In addition to this it also has the issue that I cannot figure out a way to have the code to run individually for each country within my dataset, so this is another challenge which I am not entirely sure how to address.

Could I potentially go down the route of defining a particular function which could then be applied to each country?


Solution

  • You can group by Country and aggregate min/max for both Year and Value, then calculate percentage change between min and max of the Value.

    pct_df = df.groupby(['Country']).agg(['min', 'max'])['Value']\
        .apply(lambda x: x.pct_change().round(2) * 100, axis=1)\
        .drop('min', axis=1).rename(columns={'max':'% increase'}).reset_index()
    
    print(pct_df)
    

    The output:

        Country  % increase
    0   France         87.0
    1  Germany         88.0
    2  Ireland        122.0