Search code examples
pythonpandascorrelation

How to calculate corr from a dataframe with non-numeric columns


I have these data set as shown below:

enter image description here

which belong to Pokemon dataset https://elitedatascience.com/wp-content/uploads/2022/07/Pokemon.csv

I want to plot the heatmap as shown below:

# Calculate correlations
corr = stats_df.corr()
 
# Heatmap
plt.figure(figsize=(9,8))
sns.heatmap(corr)

But I get this error below; how can I solve it?

enter image description here


Solution

  • To compute the (Pearson) correlation you need to have numeric data.

    Try:

    df = pd.read_csv('Pokemon.csv', encoding='latin1', index_col='#')
    corr = df.select_dtypes('number').drop(columns=['Total', 'Generation']).corr()
    sns.heatmap(data=corr)
    plt.tight_layout()
    plt.show()
    

    Output:

    >>> corr
                   HP    Attack   Defense   Sp. Atk   Sp. Def     Speed
    HP       1.000000  0.422386  0.239622  0.362380  0.378718  0.175952
    Attack   0.422386  1.000000  0.438687  0.396362  0.263990  0.381240
    Defense  0.239622  0.438687  1.000000  0.223549  0.510747  0.015227
    Sp. Atk  0.362380  0.396362  0.223549  1.000000  0.506121  0.473018
    Sp. Def  0.378718  0.263990  0.510747  0.506121  1.000000  0.259133
    Speed    0.175952  0.381240  0.015227  0.473018  0.259133  1.000000
    

    enter image description here