Search code examples
numpypytables

How much data can NumPy handle


I am trying to work with PyTables and NumPy.

Can you please tell me how much data the latter can handle?

I am currently handling data of 140 million rows and would like to know if NumPy can handle it. It would be nice if it could at least handle 140 million rows of 2 columns. Right now i use a 64-bit version of Windows with 8 GB of RAM.

If NumPy cannot handle this amount of data, what are the possible alternatives for statistics and machine learning algorithmic implementation?


Solution

  • 140M is much less than 2**31, so this should even fit in a 32-bit Python/Numpy given sufficient memory. You can easily try this out with

    >>> import numpy as np
    >>> X = np.empty((140e6, 2))
    

    The memory use with the standard dtype=np.float64 is on the order of 8 bytes × 140M × 2 = 2GB. If you use dtype=np.float32 you can save a factor 2.