Why am I getting an OverflowError and WindowsError with numpy memmap and how to solve it?

In relation to my other question here, this code works if I use a small chunk of my dataset with dtype='int32', using a float64 produces a TypeError on my main process after this portion because of safe rules so I'll stick to working with int32 but nonetheless, I'm curious and want to know about the errors I'm getting.

fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)
fp[:] = matrix[:]

If I use the full data (where shape=(329568, 27519)), with these dtypes:

I get OverflowError when I use int32 or int

and

I get WindowsError when I use float64

Why and how can I fix this?

Edit: Added Tracebacks

Traceback for int32

Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
    fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
WindowsError: [Error 8] Not enough storage is available to process this command

Traceback for float64

Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
    fp = np.memmap("E:/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OverflowError: cannot fit 'long' into an index-sized integer

Edit: Added other info

Other info that might help: I have a 1TB (931 GB usable) HDD with 2 partitions, Drive D (22.8GB free of 150GB) where my work files are including this script and where the memmap will be written and Drive E (406GB free of 781GB) where my torrent stuff goes. At first I tried to write the mmap file to Drive D and it generated a 1,903,283kb file for int32 and 3,806,566kb file for float64. I thought maybe because it's running out of space that's why I get those errors so I tried Drive E which should be more than enough but it generated the same file size and gave the same error.

Solution

I don't think it is possible to generate an np.memmap file that large using a 32 bit build of numpy, regardless of how much disk space you have.

The error occurs when np.memmap tries to call mmap.mmap internally. The second argument to mmap.mmap specifies the length of the file in bytes. For 329568 by 27519 array containing 64 bit (8 byte) values, the length will be 72555054336 bytes (i.e. ~72GB).

The value 72555054336 needs to be converted to an integer type that can be used as an index. In 32 bit Python, indices need to be 32 bit integer values. However, the largest number that can be represented by a 32 bit integer is much smaller than 72555054336:

print(np.iinfo(np.int32(1)).max)
# 2147483647

Even a 32 bit array would require a length of 36277527168 bytes, which is still about 16x larger than the largest representable 32 bit integer.

I don't see any easy way around this problem besides switching to 64 bit Python/numpy. There are other very good reasons to do this - 32 bit Python can only address a maximum of 3GB of RAM, even though your machine has 8GB available.

Even if you could generate an np.memmap that big, the next line

matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)

will definitely fail, since it requires creating an array in RAM that's 32GB in size. The only way that you could possibly read that CSV file is in smaller chunks, like in my answer here that I linked to in the comments above.

As I mentioned in the comments for your other question, what you ought to do is convert your TermDocumentMatrix to a scipy.sparse matrix rather than writing it to a CSV file. This would require much, much less storage space and RAM, since it can take advantage of the fact that almost all of the word counts are zero-valued.