I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15
. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.