Search code examples
pythonpandasdataframeioconcatenation

Read from 2 different directories into a pandas dataframe while creating a relationship


I have two directories. One contains images, and the other contains masks. each image in the images folder has a mask with the same filename in the masks folder. Now I want create a pandas dataframe with a single column with the list of the locations of the images and the second column containing the corresponding location of the masks. To do a preliminary investigation on how to do this, I wrote the following code:

# Generate a list of all the files and their
def generate_list(images, masks):

    images_df = pd.concat([pd.DataFrame([file],
                                        columns=['images']) for file in os.listdir(images)], ignore_index = True)
    masks_df = pd.concat([pd.DataFrame([file],
                                       columns=['masks']) for file in os.listdir(masks)], ignore_index = True)

    df = pd.concat([images_df, masks_df], axis=0, ignore_index=True)

    print(df)

    return df

However, I get the output:

       images     masks
0    47_1.bmp       NaN
1     5_1.bmp       NaN
2    26_1.bmp       NaN
3    24_1.bmp       NaN
4     7_1.bmp       NaN
5    19_1.bmp       NaN
6      19.bmp       NaN
7      18.bmp       NaN
8    45_1.bmp       NaN 
26    4_1.bmp       NaN
..        ...       ...
131       NaN    14.bmp
132       NaN  50_1.bmp
133       NaN  15_1.bmp
134       NaN  28_1.bmp
135       NaN   9_1.bmp
136       NaN    16.bmp
137       NaN  17_1.bmp
138       NaN    17.bmp
139       NaN  33_1.bmp

Clearly, os.listdir already shuffles the list of the files being taken into the concat operation.

How would I go about doing this?


Solution

  • def generate_list(images, masks):
    
        images_df = pd.concat([pd.DataFrame([images + file]) for file in os.listdir(images)], ignore_index=True)
        masks_df = pd.concat([pd.DataFrame([masks + file]) for file in os.listdir(masks)], ignore_index=True)
    
        df = pd.concat([images_df, masks_df], axis=1, ignore_index=True)
    
        return df.sample(frac=1)
    

    Here is my new answer. The axis was wrong!