Search code examples
pythonpandasmergelarge-filespython-itertools

Efficient merge for many huge csv files


I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.

Here is my code:

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.


Solution

  • Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError.

    import os
    import glob
    import pandas as pd
    os.chdir("\\path\\containing\\files")
    
    files = glob.glob("*.csv")
    hdf_path = 'my_concatenated_file.h5'
    
    with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
        # This compresses the final file by 5 using blosc. You can avoid that or
        # change it as per your needs.
        for filename in files:
            store.append('table_name', pd.read_csv(filename, sep=','), index=False)
        # Then create the indexes, if you need it
        store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')