Search code examples
pythonpandasdataframedaskglob

Combine big data stored in subdirectories as 100,000+ CSV files of size 200 GB with Python


I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files


Solution

  • This is working solution for over 100,000 files

    Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934

    import pandas as pd
    import glob 
    import time
    
        start = time.time()
        
        path = 'csv_test/data/'
        all_files = glob.glob(path + "/*.csv")
        l = []
        
        for filename in all_files:
          df = pd.read_csv(filename, index_col=None, header = 0)
          l.append(df)
        
        frame = pd.concat(l, axis = 0, ignore_index = True)
        frame.to_csv('output.csv', index = False)
        
        end = time.time()
        print(end - start)
    

    not sure if it can handle data of size 200 gb. - need feedback regarding this