Search code examples
pythondataframeinner-joinmerging-data

Merging 2 datasets in Python


I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.

inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")

TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):

| Country | TFP |

| Angola | 0.8633379340171814 |

| Angola | 0.9345720410346984 |

| Angola | 1.0301895141601562 |

| Angola | 1.0850582122802734 |

.
.
.

CO2_DATA: | Country | year | GDP | co2

| Afghanistan | 2005 | 25397688320.0 | 1

| Afghanistan | 2006 | 28704401408.0 | 2

| Afghanistan | 2007 | 34507530240.0 | 2

| Afghanistan | 2008 | 1.0850582122802734 | 3

| Afghanistan | 2009 | 1.040212631225586 | 1

. . .

What I want is

Output |Country|Year|gdp|co2|TFP Angola|2005|51967275008.0|19.006|0.8633379340171814 Angola|2006|66748907520.0|19.006|0.9345720410346984 Angola|2007|87085293568.0|19.006|1.0301895141601562 . . .

What I have instead

Output Country|Year|gdp|co2|Year|TFP Angola|2005|51967275008.0|19.006|2005|0.8633379340171814 Angola|2005|51967275008.0|19.006|2006|0.9345720410346984 Angola|2005|51967275008.0|19.006|2007|1.0301895141601562 Angola|2005|51967275008.0|19.006|2008|1.0850582122802734 Angola|2005|51967275008.0|19.006|2009|1.040212631225586 Angola|2005|51967275008.0|19.006|2010|1.0594196319580078 Angola|2005|51967275008.0|19.006|2011|1.036203384399414 Angola|2005|51967275008.0|19.006|2012|1.076979637145996 Angola|2005|51967275008.0|19.006|2013|1.0862818956375122 Angola|2005|51967275008.0|19.006|2014|1.096832513809204 Angola|2005|51967275008.0|19.006|2015|1.0682281255722046 Angola|2005|51967275008.0|19.006|2016|1.0160540342330933 Angola|2005|51967275008.0|19.006|2017|1.0

I expected the datas of the countrys' merge in one dataset but it duplicates itself until the second one data is over then the second one does the same


Solution

  • pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.

    Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).

    Specify a left outer join using the how="left" parameter in the pd.merge() function.

    merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")

    After @abokey's comment EDIT

    First, create a new column in the TFC_DATA dataset with the year value

    TFC_DATA["year"] = TFC_DATA.index.year

    Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group

    TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()

    Reset the index to make "country" and "year" columns in the resulting dataset

    TFC_DATA_agg = TFC_DATA_agg.reset_index()

    Perform the inner merge, using "country" and "year" as the merge keys

    merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])