Search code examples
matrixpysparkcorrelationpearson-correlation

How to get the correlation matrix of a pyspark data frame? NEW 2020


I have the same question from this topic:

How to get the correlation matrix of a pyspark data frame?

"I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? Thanks!"

df4 is my dataset, he has 9 columns and all of them are integers:

reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer

I have first done this step:

# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)

# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)

And got the following output:

(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1.        ,  0.0760092 ,  0.09051543,  0.07550633, -0.08058203,
       -0.24106848,  0.08229602, -0.02975856, -0.03108094,  0.0760092 ,
        1.        ,  0.14792512, -0.10744735,  0.29481762, -0.04490072,
       -0.27454922,  0.23242408,  0.32051685,  0.09051543,  0.14792512,
        1.        , -0.03708623,  0.13719527, -0.01135489,  0.08706559,
        0.24713638,  0.37453265,  0.07550633, -0.10744735, -0.03708623,
        1.        , -0.49640664,  0.01885793,  0.25877516, -0.05019079,
       -0.13878844, -0.08058203,  0.29481762,  0.13719527, -0.49640664,
        1.        ,  0.01080777, -0.42319841,  0.01229877,  0.16440178,
       -0.24106848, -0.04490072, -0.01135489,  0.01885793,  0.01080777,
        1.        ,  0.00523737,  0.01244241,  0.01811365,  0.08229602,
       -0.27454922,  0.08706559,  0.25877516, -0.42319841,  0.00523737,
        1.        ,  0.32888075,  0.21416322, -0.02975856,  0.23242408,
        0.24713638, -0.05019079,  0.01229877,  0.01244241,  0.32888075,
        1.        ,  0.53310864, -0.03108094,  0.32051685,  0.37453265,
       -0.13878844,  0.16440178,  0.01811365,  0.21416322,  0.53310864,
        1.        ])

I've tried to insert this result on arrays or an excel file but it didnt work. I did:

matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])

Then I got the following error when I tried to display this info:

display(matrix2)

Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.

I was expecting to insert the name of the columns back from df4 but it didnt succeed, I've read that I need to use df4.columns but I have no idea how does it works.

Finally, I was expecting to print the following graph that I've seen from medium article

https://medium.com/towards-artificial-intelligence/feature-selection-and-dimensionality-reduction-using-covariance-matrix-plot-b4c7498abd07

But also it didn't work:

from sklearn.preprocessing import StandardScaler 
stdsc = StandardScaler() 
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
                 cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size': 12},
                 cmap='coolwarm',                 
                 yticklabels=cols,
                 xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()


AttributeError: 'DataFrame' object has no attribute 'iloc'

I've tried to replace df4 to matrix2 and didn't work too


Solution

  • You can use the following to get the correlation matrix in a form you can manipulate:

    matrix = matrix.toArray().tolist() 
    

    From there you can convert to a dataframe pd.DataFrame(matrix) which would allow you to plot the heatmap, or save to excel etc.