Search code examples
pythonpandaspandas-profiling

How to only generate the correlations and scatter plots using Pandas Profilling package?


I am handling a large dataset and I have used Pandas Profilling package. But since the dataset is large, the report is taking too long to generate and browsers are failing to open it. So, I have use "mininmal=True" command, which excludes the correlation matrices and the scatter plots. Is there any way I can generate only the correlation matrices and scatter plots using Pandas Profilling.

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='EDA_Raw_Data', html={'style':{'full_width':True}},minimal=True)
profile.to_file(output_file="EDA1_Raw_Data.html")

Solution

  • This is partially possible.

    To set the configuration of pandas-profiling to only present scatter plots (or hexbins) and correlation plots, you can start at the minimal configuration:

    https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml

    Then, change the configuration to exclude the computation that you would like to disable (e.g. set samples to zero).

    from pandas_profiling import ProfileReport
    profile = ProfileReport(df, configuration_file="your_config.yml")
    profile.to_file("EDA1_Raw_Data.html")
    

    Note that at this moment, it is not possible to disable all calculations (at v2.6.0). Please make a feature request at the repository for that.

    (Disclaimer: Author here. Note that the upcoming v2.7.0 includes significant perfomance improvements, that might also resolve your issue. )