Search code examples
pythonpandasoperating-systemglob

Using Os and Glob to Search and Concatenate .csv Files and Pandas to Create DataFrame


Problem

I have multliple directories each with subdirectories. These subdirectories contain .csv files with numerical data in them. I want to us glob and os (not shell scripts) to search two specified directories and then locate specific folders and concatenate them in a format I will describe below.

dir1 contains subdir1 contains A.csv 
     contains subdir2 contains B.csv

dir2 contains subdir1 contains A.csv
     contains subdir2 contains B.csv

IN BOTH CASES

>>> cat A.csv
1
2
3
4
5
>>> cat B.csv
6
7
8
9
10

MY DESIRED BEHAVIOUR

Find A.csv in dir1 and find A.csv in dir2, searching every folder and directory, and then merge them. After merge, create pandas.DataFrame

>>> python3 merge.py dir1 dir2 A.csv
# prints df created from out.csv
   x   y
0  1   1 
1  2   2 
2  3   3
3  4   4
4  5   5
>>> cat out.csv
1
2
3
4
5
1
2
3
4
5

ASK QUESTIONS IF NEEDED


Solution

  • You can use os.walk to walk through directories and glob.glob to search for *.csv files like so:

    from os import walk
    from os.path import join
    from glob import glob
    root_dir = '/some/path/to_a_directory/'
    for rootdir, _, _ in walk(root_dir):
        all_csv = glob(join(root_dir, '*.csv'))
        for fpath in all_csv:
            # Open the file and do something with it