Search code examples
pythonperformancematplotlibhistogrambins

python very long compilation time


I am using plt.hist() function to show histogram. When I tried it on a smaller dataset, everything works fine. However, my original dataset contains nearly 30k samples, for which I need to show on that histogram 6 values per sample. I am aware this is a lot, but what I need help with is how to make the compilation time in my case smaller. I am okay waiting 10 minutes, but yesterday I was waiting for the result over an hour and I gave up.

How can I optimize it and reduce the compilation time? My first idea was adding bins to that function, so something like this:

plt.hist(values, bins=50)

But I am not sure what exactly bins do. Will this result in printing the histogram too general for my data or will it just take 50 first values from my data? Besides, will it shorten the compilation time? What can I do?


Solution

  • But I am not sure what exactly bins do. Will this result in printing the histogram too general for my data or will it just take 50 first values from my data?

    You can imagine bins as a partition of your x-axis. The higher the number of bins, the more your histogram will be smooth.

    Having 50 bins means that the range values of the data you're plotting will be subdivided in 50 equal sections, and in each bin you'll have the count of the elements that have a value that falls inside the bin range.

    Let's say you want to make an histogram of elements that have values from 0 to 99, and you make 10 bins. The first bin, for example, will count the number of elements whose value is 0 <= elem_val <= 9. The second bin will include the elements whose value is 10 <= elem_val <= 19, and so on.

    So if you add more bins, the ranges will be smaller and contain less elements, but the histogram will be more precise.

    Besides, will it shorten the compilation time? What can I do?

    This answer looks good to me: https://stackoverflow.com/a/39582304/11153525