My objective is to plot a histogram given values and counts. hist
only takes an array of data as input. I have tried to recreat data using np.repeat
, but this gives MemoryError: Unable to allocate 15.9 GiB for an array with shape (2138500000,) and data type float64
.
Wanted to know if there is a smarter way of doing this.
import numpy as np
import matplotlib.pyplot as plt
values = [ 1, 2, 2.5, 4, 5, 5.75, 6.5]
counts = [10**8, 10**9, 1.5*10**7, 1.25*10**7, 10**6, 10**7,10**9]
data_recreated = np.repeat(values, counts)
f1, ax = plt.subplots(1,1)
ax.hist(data_recreated, bins=5)
As i mentioned in comments not sure what is your ideal use cases for using repeat here? Are you trying to say you want to represent a dataset where: The value 1 appears 100 million times The value 2 appears 1 billion times and so on in orginal array?...
If so then Instead of creating the full array, we can achieve the same statistical visualization using weights.
import numpy as np
import matplotlib.pyplot as plt
values = [1, 2, 2.5, 4, 5, 5.75, 6.5]
counts = [10**8, 10**9, 1.5*10**7, 1.25*10**7, 10**6, 10**7, 10**9]
plt.figure(figsize=(10, 6))
plt.hist(values, bins=5, weights=counts, edgecolor='black')
plt.title('Histogram of Values')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
# Add grid for better readability
plt.grid(True, alpha=0.3)
plt.show()
To verify
counts_hist, bin_edges, _ = plt.hist(values, bins=5, weights=counts)
print("\nBin edges:", bin_edges)
print("Counts in each bin:", counts_hist)
Bin edges: [1. 2.1 3.2 4.3 5.4 6.5]
Counts in each bin: [1.10e+09 1.50e+07 1.25e+07 1.00e+06 1.01e+09]
which is
1.10 billion
15 million
12.5 million
1 million
1.01 billion
so a result
When you use plt.hist(values, bins=5, weights=counts)
it:
Places each value in its appropriate bin Instead of counting 1 for each value, it adds the weight for that value and Creates bars with heights equal to the sum of weights in each bin ultimately result is identical