Search code examples
pythonpandasdataframecovariance

Calculating covariance matrix amongst different features using Pandas dataframe


I have a dataset into a pandas dataframe with 9 set of features and 249 rows, I would like to get a covariance matrix amongst the 9 features (resulting in a 9 X 9 matrix), however, when I use the df.cov() function, I only get a 3 X 3 matrix. What am I doing wrong here?

Thanks!

Below is my code snippet

# perform data preprocessing
# only get players with MPG with less than 20 and only select the required colums
MPG_df = df.loc[df['MPG'] >= 20]
processed_df = MPG_df[["FT%", "2P%", "3P%", "PPG", "RPG", "APG", "SPG", "BPG", "TOPG"]]
processed_df

enter image description here

And when I attempt in getting the covariance matrix using the code below, I only get a 3 X 3 matrix

#desired result
cov_processed_df = df = pandas.DataFrame(processed_df, columns=['FT%', '2P%', '3P%', 'PPG', 'RPG', 'APG', 'SPG', 'BPG', 'TOPG']).cov()
cov_processed_df

enter image description here

Thanks!


Solution

  • The excluded columns are probably non-numeric (even though they look like so!). Try

    cov_processed_df = processed_df.astype(float).cov()
    

    To see the data types of the original df, you may run:

    print(processed_df.dtypes)
    

    If you see "object" appearing in the result, then it means those columns are non-numeric. (Even if they contain at least 1 non-numeric data, they are flagged as non-numeric.)