Search code examples
pandasmergescatter-plot

How to (correctly) merge 2 Pandas DataFrames and scatter-plot


Thank you for your answers, in advance.

My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq'). A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.

pd.DataFrame(
    {
        "country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
        "1975": {0: nan, 1: nan, 2: nan},
        "1976": {0: nan, 1: nan, 2: nan},
        "2017": {0: nan, 1: 33.2, 2: nan},
        "2018": {0: 51.3, 1: nan, 2: nan},
    }
)

pd.DataFrame(
    {
        "country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
        "1975": {0: 44.8, 1: 48.1, 2: 75.1},
        "1976": {0: 44.8, 1: 48.1, 2: 75.1},
        "2018": {0: 24.2, 1: 40.4, 2: 28.4},
        "2019": {0: 40.5, 1: 37.6, 2: 35.9},
    }
)

I concatenate and manipulate and get

cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
    level=1, drop=True
)

a new Dataframe

pd.DataFrame(
    {
        "indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
        "country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
        "1967": {0: nan, 1: nan, 2: nan},
        "1969": {0: nan, 1: nan, 2: nan},
        "2018": {0: 51.3, 1: nan, 2: nan},
        "2019": {0: nan, 1: nan, 2: nan},
    }
)

Solution

  • You should concatenate your dataframe in a different way:

    df = (pd.concat([inq.set_index('country'),
                     corr.set_index('country')],
                     axis=1,
                     keys=["Inequality", "Corruption"]
                    )
            .stack(level=1)
         )
    
                      Inequality  Corruption
    country                                 
    Angola      1975         NaN        48.1
                1976         NaN        48.1
                2018        51.3        40.4
                2019         NaN        37.6
    Albania     1975         NaN        75.1
                1976         NaN        75.1
                2017        33.2         NaN
                2018         NaN        28.4
                2019         NaN        35.9
    Afghanistan 1975         NaN        44.8
                1976         NaN        44.8
                2018         NaN        24.2
                2019         NaN        40.5
    

    Then to plot:

    df.plot.scatter(x='Corruption', y='Inequality')
    

    enter image description here

    NB. there is only one point as most of your data is NaN