Search code examples
pythonpython-3.xmatplotlibseabornscaling

Plotting a column with millions of rows


I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test. First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.

I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.

enter image description here

Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.

result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)

What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value? I read the documentation for sns.distplot() but didn't find something helpful.


Solution

  • As per the documentation for displot (emphasis mine)

    norm_hist : bool, optional

    If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.

    Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time

    sns.distplot(a, kde=True, norm_hist=False)
    

    enter image description here

    sns.distplot(a, kde=False, norm_hist=False)
    

    enter image description here