Search code examples
pythonpandashdf5hdfstore

Too many open files in Windows when writing multiple HDF5 files


My question is how to close HDF5 files indefinitely after writing them?

I am trying to save data to HDF5 files - there are around 200 folders and each folder contains some data for each day for this year.

When I retrieve and save data using pandas HDFStore with following code in iPython console, the function stop automatically after a while (no error msg).

import pandas as pd

data = ... # in format as pd.DataFrame
# Method 1
data.to_hdf('D:/file_001/2016-01-01.h5', 'type_1')
# Method 2
with pd.HDFStore('D:/file_001/2016-01-01.h5', 'a') as hf:
    hf['type_1'] = data

When I tried the same script to download data again, it says:

[Errno 24] Too many open files: ...

There are some posts suggesting using ulimit -n 1200 for example in Linux to overcome the problem, but unfortunately I'm using Windows.

Besides, I think I already close files explicitly using with closure, especially in Method 2. How come iPython still count these files as open?

My loop is sth like below:

univ = pd.read_excel(univ_file, univ_tab)
for dt in pd.DatetimeIndex(start=start_date, end=end_date, freq='B'):
    for t in univ:
        data = download_data(t, dt)
        with pd.HDFStore(data_file, 'a') as hf:
            # Use pd.DataFrame([np.nan]) instead of pd.DataFrame() to save space
            hf[typ] = EMPTY_DF if data.shape[0] == 0 else data

Solution

  • You can check / list all open files belonging to Python process in Windows using psutil module.

    Demo:

    In [52]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
    Out[52]:
    [[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
      popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite-journal', fd=-1),
      popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]
    

    a file handler will be closed as soon as we are done with the following block:

    In [53]: with pd.HDFStore('d:/temp/1.h5', 'a') as hf:
       ....:     hf['df2'] = df
       ....:
    

    prove:

    In [54]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
    Out[54]:
    [[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
      popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]
    

    check whether psutil works properly at all (pay attention at the D:\\temp\\aaa):

    In [55]: fd = open('d:/temp/aaa', 'w')
    
    In [56]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
    Out[56]:
    [[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
      popenfile(path='D:\\temp\\aaa', fd=-1),
      popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]
    
    In [57]: fd.close()
    
    In [58]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
    Out[58]:
    [[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
      popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]
    

    So using this technique you can debug your code and find the place where the number of open files goes crazy in your case