I have in the order or 10^5 binary files which I read one by one in a for
loop with numpy's fromfile
and plot with pyplot's imshow
. Each file takes about a minute to read and plot.
Is there a way to speed things up?
Here is some pseudo code to explain my situation:
#!/usr/bin/env python
import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
nx = 1200 ; ny = 1200
fig, ax = plt.subplots()
ax.set_xlabel('x')
ax.set_ylabel('y')
for f in files:
data = np.fromfile(open(f,'rb'), dtype=float32, count=nx*ny)
data.resize(nx,ny)
im = ax.imshow(data)
fig.savefig(f+'.png', dpi=300, bbox_inches='tight')
im.remove()
I found the last step to be crucial so that memory does not explode.
Weird, after a reboot, a solution I don't usually resort to, read time is down to ~0.002 seconds (on average) per file, and render time is ~0.02 seconds. Saving the .png
file takes ~2.6 seconds so all in all, each frame takes about 2.7 seconds.
I took @DrV 's advice,
...I would just start four (assuming there are four cores) copies of the script, each copy having access to a different 2.5 x 10^4 set of images. With a SSD disk this should not cause I/O seek catastrophes.
partitioned the files list to 8 sublists and ran 8 instances of my script.
@DrV's comment
Also, your 0.002 s read time for a 5.7 MB file read does not sound realistic if the file is not in the RAM cache, as it would indicate disk read speed of 2.8 GB/s. (Fast SSDs may just reach 500 MB/s.)
made me benchmark the read/write speeds on my laptop (MacBookPro10,1). I used the following code to produce 1000 files with 1200*1200 random floats (4 Bytes) such that each file is 5.8 MB (1200*1200*4 = 5,760,000 Bytes) and then read them one by one, timing the process. THe code is run from the terminal and never takes up more than 50 MB or memory (quite a lot for holding only one data array of 5.8 MB in memory, no?).
The code:
#!/usr/bin/env ipython
import os
from time import time
import numpy as np
temp = 'temp'
if not os.path.exists(temp):
os.makedirs(temp)
print 'temp dir created'
os.chdir(temp)
nx = ny = 1200
nof = 1000
print '\n*** Writing random data to files ***\n'
t1 = time(); t2 = 0; t3 = 0
for i in range(nof):
if not i%10:
print str(i),
tt = time()
data = np.array(np.random.rand(nx*ny), dtype=np.float32)
t2 += time()-tt
fn = '%d.bin' %i
tt = time()
f = open(fn, 'wb')
f.write(data)
f.close
t3 += time()-tt
print '\n*****************************'
print 'Total time: %f seconds' %(time()-t1)
print '%f seconds (on average) per random data production' %(t2/nof)
print '%f seconds (on average) per file write' %(t3/nof)
print '\n*** Reading random data from files ***\n'
t1 = time(); t3 = 0
for i,fn in enumerate(os.listdir('./')):
if not i%10:
print str(i),
tt = time()
f = open(fn, 'rb')
data = np.fromfile(f)
f.close
t3 += time()-tt
print '\n*****************************'
print 'Total time: %f seconds' %(time()-t1)
print '%f seconds (on average) per file read' %(t3/(i+1))
# cleen up:
for f in os.listdir('./'):
os.remove(f)
os.chdir('../')
os.rmdir(temp)
The result:
temp dir created
*** Writing random data to files ***
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990
*****************************
Total time: 25.569716 seconds
0.017786 seconds (on average) per random data production
0.007727 seconds (on average) per file write
*** Reading random data from files ***
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990
*****************************
Total time: 2.596179 seconds
0.002568 seconds (on average) per file read