I am trying to work with PyTables
and NumPy
.
Can you please tell me how much data the latter can handle?
I am currently handling data of 140 million rows and would like to know if NumPy
can handle it. It would be nice if it could at least handle 140 million rows of 2 columns. Right now i use a 64-bit version of Windows with 8 GB of RAM.
If NumPy
cannot handle this amount of data, what are the possible alternatives for statistics and machine learning algorithmic implementation?
140M is much less than 2**31, so this should even fit in a 32-bit Python/Numpy given sufficient memory. You can easily try this out with
>>> import numpy as np
>>> X = np.empty((140e6, 2))
The memory use with the standard dtype=np.float64
is on the order of 8 bytes × 140M × 2 = 2GB. If you use dtype=np.float32
you can save a factor 2.