Search code examples
pythonpandasoperating-systemglobpython-os

Manipulating Dataframes in different sub directories


I have many subdirecotries in which I have unique datasets. I want to do some manipulations on this df individually. Something like: Access to each subdirectory, do manipulation, go to next directory and do the same. For illustrative purposes I can provide the code:

import pandas as pd
import numpy as np
import os 


os.mkdir('folder1')

d = {'column1': ['a', 'a', 'b', 'b', 'c'], 'column2': [10, 8, 6, 4, 2], 'column3': [1, 2, 3, 4, 5]}
test_a  = pd.DataFrame(data=d)
test_a.to_csv('folder1/test_a.csv')

os.mkdir('folder2')
g = {'column1': ['a', 'a', 'b', 'b', 'c'], 'column2': [10, 8, 6, 4, 2], 'column3': [1, 2, 3, 4, 5]}
test_b = pd.DataFrame(data=g)
test_b.to_csv('folder2/test_b.csv')

The code above creates the subdirectories and then saves example df in this subdirectory.

Let's say I want to achieve the following:

Grouby (count) each dataset in each folder by column1, and save it in the corresponding subdirectory as a separate data frame. Better to call each data frame by the starting letters (test in this case), rather than its extension (csv).

I can write the general function on how to grouby the datasets, but I don't know how to access each subdirectory. (probably using the for loop and os/glob package).

Thanks in advance.


Solution

  • Use pathlib:

    import pandas as pd
    import pathlib
    
    # directory where data files are stored
    data_dir = pathlib.Path('data')
    
    for csvfile in data_dir.glob('**/*.csv'):
        print(f"Processing '{csvfile.name}' in '{csvfile.parent}'")
        df = pd.read_csv(csvfile)
        # do stuff here
        out = df.groupby('column1').mean()  # mean or whatever you want
        out.to_csv(csvfile.parent / f"{csvfile.stem}_grp.csv")
        print(f"Saved as '{csvfile.stem}_grp.csv' in '{csvfile.parent}'")
        print()
    

    Output:

    Processing 'test_a.csv' in 'data/folder1'
    Saved as 'test_a_grp.csv' in 'data/folder1'
    
    Processing 'test_b.csv' in 'data/folder2'
    Saved as 'test_b_grp.csv' in 'data/folder2'
    

    Directory tree:

    data
    ├── folder1
    │   ├── test_a.csv
    │   └── test_a_grp.csv
    └── folder2
        ├── test_b.csv
        └── test_b_grp.csv