Combine big data stored in subdirectories as 100,000+ CSV files of size 200 GB with Python

I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files

Solution

This is working solution for over 100,000 files

Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934

import pandas as pd
import glob 
import time

    start = time.time()
    
    path = 'csv_test/data/'
    all_files = glob.glob(path + "/*.csv")
    l = []
    
    for filename in all_files:
      df = pd.read_csv(filename, index_col=None, header = 0)
      l.append(df)
    
    frame = pd.concat(l, axis = 0, ignore_index = True)
    frame.to_csv('output.csv', index = False)
    
    end = time.time()
    print(end - start)

not sure if it can handle data of size 200 gb. - need feedback regarding this

Python method chaining in functional programming style
flask-jwt-extended: Fake Authorization Header during testing (pytest)
For loop through the list unless empty?
Polars make all groups the same size
Is there a way to specify a default base-template for all templates in django?
How to tackle time limit exceeded error in leetcode
Is pd.get_dummies() updated in newer versions of Pandas making it default to Booleans (True/False) instead of (0/1)?
What's the function like sum() but for multiplication? product()?
How to type hint a dynamically-created dataclass
Issue with pulling the data with EIA API with Python
403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?
Fullstack web-hosting services
How to handle an AnalysisException on Spark SQL?
Python requests is slow and takes very long to complete HTTP or HTTPS request
Is there a way to modify an element in a Numpy array based on the value of other elements?
Tkinter grid manager height/width nonconsistent
Sql Alchemy Insert Statement failing to insert, but no error
How can I create a Polars struct while eval-ing a list?
Excel using win32com and python
django: on pypy, psyco, unladen swallow or cpython, which one is the fastest?
How convert a list into multiple columns and a dataframe?
Name not defined in type annotation
Static type checkers and Language Servers not recognizing attributes of objects that are subclasses
How do I get multiple OID values in PySNMP?
how to log a file in Django
Is there a simple and efficient way to evaluate Elementary Symmetric Polynomials in Python?
Iterating over two lists one after another
Python -i flag for production
What is PyCompilerFlags in Python C API?
How to make Python check whether a variable is a number or letter