Search code examples
pythonnumpymatplotlib

MemoryError in creating large numpy array


My objective is to plot a histogram given values and counts. hist only takes an array of data as input. I have tried to recreat data using np.repeat, but this gives MemoryError: Unable to allocate 15.9 GiB for an array with shape (2138500000,) and data type float64.

Wanted to know if there is a smarter way of doing this.

import numpy as np 
import matplotlib.pyplot as plt 

values = [ 1, 2, 2.5, 4, 5, 5.75, 6.5]
counts = [10**8, 10**9, 1.5*10**7, 1.25*10**7, 10**6, 10**7,10**9]

data_recreated = np.repeat(values, counts)

f1, ax = plt.subplots(1,1)

ax.hist(data_recreated, bins=5)

Solution

  • As i mentioned in comments not sure what is your ideal use cases for using repeat here? Are you trying to say you want to represent a dataset where: The value 1 appears 100 million times The value 2 appears 1 billion times and so on in orginal array?...

    If so then Instead of creating the full array, we can achieve the same statistical visualization using weights.

    import numpy as np
    import matplotlib.pyplot as plt
    
    
    values = [1, 2, 2.5, 4, 5, 5.75, 6.5]
    counts = [10**8, 10**9, 1.5*10**7, 1.25*10**7, 10**6, 10**7, 10**9]
    
    
    plt.figure(figsize=(10, 6))
    
    
    plt.hist(values, bins=5, weights=counts, edgecolor='black')
    
    
    plt.title('Histogram of Values')
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    
    
    plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
    
    # Add grid for better readability
    plt.grid(True, alpha=0.3)
    
    
    plt.show()
    

    which results enter image description here

    To verify

    counts_hist, bin_edges, _ = plt.hist(values, bins=5, weights=counts)
    print("\nBin edges:", bin_edges)
    print("Counts in each bin:", counts_hist)
    
    Bin edges: [1.  2.1 3.2 4.3 5.4 6.5]
    Counts in each bin: [1.10e+09 1.50e+07 1.25e+07 1.00e+06 1.01e+09]
    

    which is

    1.10 billion
    15 million
    12.5 million
    1 million
    1.01 billion
    

    so a result

    When you use plt.hist(values, bins=5, weights=counts) it:

    Places each value in its appropriate bin Instead of counting 1 for each value, it adds the weight for that value and Creates bars with heights equal to the sum of weights in each bin ultimately result is identical