In relation to my other question here, this code works if I use a small chunk of my dataset with dtype='int32'
, using a float64
produces a TypeError on my main process after this portion because of safe
rules so I'll stick to working with int32
but nonetheless, I'm curious and want to know about the errors I'm getting.
fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)
fp[:] = matrix[:]
If I use the full data (where shape=(329568, 27519)
), with these dtypes:
I get OverflowError when I use int32 or int
and
I get WindowsError when I use float64
Why and how can I fix this?
Edit: Added Tracebacks
Traceback for int32
Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
WindowsError: [Error 8] Not enough storage is available to process this command
Traceback for float64
Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
fp = np.memmap("E:/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OverflowError: cannot fit 'long' into an index-sized integer
Edit: Added other info
Other info that might help: I have a 1TB (931 GB usable) HDD with 2 partitions, Drive D (22.8GB free of 150GB) where my work files are including this script and where the memmap will be written and Drive E (406GB free of 781GB) where my torrent stuff goes. At first I tried to write the mmap file to Drive D and it generated a 1,903,283kb file for int32 and 3,806,566kb file for float64. I thought maybe because it's running out of space that's why I get those errors so I tried Drive E which should be more than enough but it generated the same file size and gave the same error.
I don't think it is possible to generate an np.memmap
file that large using a 32 bit build of numpy, regardless of how much disk space you have.
The error occurs when np.memmap
tries to call mmap.mmap
internally. The second argument to mmap.mmap
specifies the length of the file in bytes. For 329568 by 27519 array containing 64 bit (8 byte) values, the length will be 72555054336 bytes (i.e. ~72GB).
The value 72555054336 needs to be converted to an integer type that can be used as an index. In 32 bit Python, indices need to be 32 bit integer values. However, the largest number that can be represented by a 32 bit integer is much smaller than 72555054336:
print(np.iinfo(np.int32(1)).max)
# 2147483647
Even a 32 bit array would require a length of 36277527168 bytes, which is still about 16x larger than the largest representable 32 bit integer.
I don't see any easy way around this problem besides switching to 64 bit Python/numpy. There are other very good reasons to do this - 32 bit Python can only address a maximum of 3GB of RAM, even though your machine has 8GB available.
Even if you could generate an np.memmap
that big, the next line
matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)
will definitely fail, since it requires creating an array in RAM that's 32GB in size. The only way that you could possibly read that CSV file is in smaller chunks, like in my answer here that I linked to in the comments above.
As I mentioned in the comments for your other question, what you ought to do is convert your TermDocumentMatrix
to a scipy.sparse
matrix rather than writing it to a CSV file. This would require much, much less storage space and RAM, since it can take advantage of the fact that almost all of the word counts are zero-valued.