Search code examples
pythonpandaseconomics

How to combine and do group computations on Pandas datasets?


I'm working on an economics paper and need some help with combining and transforming two datasets.

I have two pandas dataframes, one with a list of countries and their neighbors (borderdf) such as

borderdf
country    neighbor
sweden     norway
sweden     denmark
denmark    germany
denmark    sweden

and one with data (datadf) for each country and year such as

datadf
country    gdp    year
sweden     5454   2004
sweden     5676   2005
norway     3433   2004
norway     3433   2005
denmark    2132   2004
denmark    2342   2005

I need to create a column in the datadf for neighbormeangdp that would contain the mean of the gdp of all the neighbors, as given by neighbordf. I would like my result to look like this:

datadf
country    year    gdp    neighborsmeangdp
sweden     2004    5454   5565
sweden     2005    5676   5775

How should I go about doing this?


Solution

  • You can directly merge the two using pandas merge function. The trick here is that you actually want to merge the country column in your datadf with the neighbor column in your borderdf. Then use groupby and mean to get the average neighbor gdp. Finally, merge back with the data to get the country's own GDP. For example:

    import pandas as pd
    from StringIO import StringIO
    
    border_csv = '''
    country, neighbor
    sweden, norway
    sweden, denmark
    denmark, germany
    denmark, sweden
    '''
    
    data_csv = '''
    country, gdp, year
    sweden, 5454, 2004
    sweden, 5676, 2005
    norway, 3433, 2004
    norway, 3433, 2005
    denmark, 2132, 2004
    denmark, 2342, 2005
    '''
    
    borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
    data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
    
    merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
    merged = merged.drop('country_y', axis=1)
    merged.columns = ['country','neighbor','gdp','year']
    
    
    grouped = merged.groupby(['country','year'])
    neighbor_means = grouped.mean()
    neighbor_means.columns = ['neighbor_gdp']
    neighbor_means.reset_index(inplace=True)
    
    results_df = pd.merge(neighbor_means,data, on=['country','year'])