I am trying to read a csv file I created before in python using
with open(csvname, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerows(data)
Data ist a random matrix containing about 30k * 30k entries, np.float32 format. About 10 GB file size in total.
Once I read in the file using this function (since I know the size of my matrix already and np.genfromtxt is increadibly slow and would need about 100 GB RAM at this point)
def read_large_txt(path, delimiter=',', dtype=np.float32, nrows = 0):
t1 = time.time()
with open(path, 'r') as f:
out = np.empty((nrows, nrows), dtype=dtype)
for (ii, line) in enumerate(f):
if ii%2 == 0:
out[int(ii/2)] = line.split(delimiter)
print('Reading %s took %.3f s' %(path, time.time() - t1))
return out
it takes me about 10 minutes to read that file. The hard drive I am using should be able to read about 100 MB/s which would decrease the reading time to about 1-2 minutes.
Any ideas what I may be doing wrong?
Related: why numpy narray read from file consumes so much memory? That's where the function read_large_txt is from.
I found a quite simple solution. Since I am creating the files as well, I don't need to save them as a .csv-file. It is way (!) faster to load them as .npy files:
Loading (incl. splitting each line by ',') a 30k * 30k matrix stored as .csv takes about 10 minutes. Doing the same with a matrix stored as .npy takes about 10 seconds!
That's why I have to change the code I wrote above to:
np.save(npyname, data)
and in the other script to
out = np.load(npyname + '.npy')
Another advantage of this method is: (in my case) the .npy files only have about 40% the size of the .csv files. :)