python loops numpy input processing-efficiency

Python input/output more efficiently

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.

I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).

The code I have now is:

import numpy as np
from itertools import izip
import fitsio

x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
    for line1, line2 in izip(file_ID,file_c):
        parts1 = line1.split()
        parts2 = line2.split()
        n1.append(parts1[0])
        n2.append(parts1[1])
        n3.append(float(parts1[2]))
        x.append(float(parts2[0]))         
        y.append(float(parts2[1]))        
        r.append(float(parts2[2]))
        s.append(float(parts2[3]))  

def data_analysis(idx_start,idx_end):  #### loop over 10 million entries
    data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])

    for i in xrange(idx_start,idx_end)
        filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
        fits_spectra = fitsio.FITS(filename)
        fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000]  #### return a list of list
        flux = fluxx[0]
        hdu = fits_spectra[0].read_header()
        wave_start = hdu['CRVAL1']
        logwave = wave_start + 0.0001 * np.arange(4000)
        wavegrid = np.power(10,logwave)

    ##### After I read the flux and the wavegrid, then I can do my following analysis.

    ##### save data to data_stru

    ##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.

The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.

Solution

If I understand your code correctly, n1 and n2 determine which file to open. So why do you not just lexsort them. You can then use itertools.groupby to group records with the same n1, n2. Here is a down-scaled proof of concept:

import itertools

n1 = np.random.randint(0, 3, (10,))
n2 = np.random.randint(0, 3, (10,))
mockdata = np.arange(10)+100

s = np.lexsort((n2, n1))

for k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):
    # groupby groups the iterations i of its first argument
    # (zip(...) in this case) by the result of applying the
    # optional second argument (here lambda) to i.
    # Here we use the lambda expression to remove si from the
    # tuple (si, n1si, n2si) that zip produces because otherwise
    # equal (n1si, n2si) pairs would still be treated as different
    # because of the distinct si's. Hence no grouping would occur.
    # Putting si in there in the first place is necessary, so we
    # we can reference the other records of the corresponding row
    # in the inner loop.
    print(k)
    for si, n1s, ns2 in g:
        # si can be used to access the corresponding other records
        print (si, mockdata[si])

Prints something like:

You may want to include n3 in the lexsort, but not the grouping so you can process the files' content in order.