I'm working on an economics paper and need some help with combining and transforming two datasets.
I have two pandas dataframes, one with a list of countries and their neighbors (borderdf) such as
borderdf
country neighbor
sweden norway
sweden denmark
denmark germany
denmark sweden
and one with data (datadf) for each country and year such as
datadf
country gdp year
sweden 5454 2004
sweden 5676 2005
norway 3433 2004
norway 3433 2005
denmark 2132 2004
denmark 2342 2005
I need to create a column in the datadf for neighbormeangdp that would contain the mean of the gdp of all the neighbors, as given by neighbordf. I would like my result to look like this:
datadf
country year gdp neighborsmeangdp
sweden 2004 5454 5565
sweden 2005 5676 5775
How should I go about doing this?
You can directly merge the two using pandas merge
function.
The trick here is that you actually want to merge the country column in your datadf
with the neighbor column in your borderdf
.
Then use groupby
and mean
to get the average neighbor gdp.
Finally, merge back with the data to get the country's own GDP.
For example:
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])