Search code examples
pythonpandaspathlib

Melting Python Pandas and Input Filename in column


I have about (100 files +) XLS files in a folder with different columns names and data types

File_1.xls:

Id  test  category
1    ab      4
2    cs      3
3    cs      1

FILE_2.xls:

index remove  stocks  category
1       dr      4         a
2       as      3         b
3       ae      1         v

File 3: ....

File 100.....

Thats my code:

import pandas as pd
import pathlib

data = []
for filename in pathlib.Path.cwd().iterdir():
    if filename.suffix.lower().startswith('.xls'):
        data.append(pd.read_excel(filename).melt())
df = pd.concat(data, ignore_index=True)

I would like to have the Dataframe in my output like this:

        filename variable value
    0   File_1     Id      1
    1   File_1     Id      2
    2   File_1     Id      3
    3   File_1     test    ab
    4   File_1     test    cs
    5   File_1     test    cs
    6   File_1  category    4
    7   File_1  category    3
    8   File_1  category    1
    9   File_1   index      1
    10  File_1   index      2
    11  File_1   index      3
    12  FILE_2   remove     dr
    13  FILE_2  remove      as
    14  FILE_2  remove      ae
    15  FILE_2  stocks      4
    16  FILE_2  stocks      3
    17  FILE_2  stocks      1
    18  FILE_2  category    a
    19  FILE_2  category    b
    20  FILE_2  category    v
  1000  FILE_100 ....       ..

How I can melt all mu columns and keep in the column "filename" the name of the source file?


Solution

  • Don't read your data in the loop but rather collect the filenames, then use a dictionary comprehension to add the filenames as concatenation keys:

    import pandas as pd
    import pathlib
    
    files = []
    for filename in pathlib.Path.cwd().iterdir():
        if filename.suffix.lower().startswith('.xls'):
            files.append(filename)
            
    df = (pd.concat({f: pd.read_excel(f).melt() for f in files})
            .rename_axis(['filename', None]).reset_index(0)
            .reset_index(drop=True)
         )
    

    output:

       filename  variable value
    0    File_1        Id     1
    1    File_1        Id     2
    2    File_1        Id     3
    3    File_1      test    ab
    4    File_1      test    cs
    5    File_1      test    cs
    6    File_1  category     4
    7    File_1  category     3
    8    File_1  category     1
    9    File_2     index     1
    10   File_2     index     2
    11   File_2     index     3
    12   File_2    remove    dr
    13   File_2    remove    as
    14   File_2    remove    ae
    15   File_2    stocks     4
    16   File_2    stocks     3
    17   File_2    stocks     1
    18   File_2  category     a
    19   File_2  category     b
    20   File_2  category     v