Search code examples
pythonpandaspandas-melt

Pandas Getting first number of rows of each xls file


I have about (100 files +) XLS files in a folder with different columns names and data types.

File_1.xls:

Id  test  category
1    ab      4
2    cs      3
3    cs      1
..   ..

FILE_2.xls:

index remove  stocks  category
1       dr      4         a
2       as      3         b
3       ae      1         v
..

File 3: ....

File 100.....

This is my code:

import pandas as pd
import pathlib

    files = []
    for filename in pathlib.Path.cwd().iterdir():
        if filename.suffix.lower().startswith('.xls'):
            files.append(filename)
            
    df = (pd.concat({f: pd.read_excel(f).melt() for f in files})
            .rename_axis(['filename', None]).reset_index(0)
            .reset_index(drop=True)
            .iloc[:1]
         )

Output

   filename  variable value
0    File_1        Id     1
1    File_1        Id     2

Expected Results

   filename  variable value
0    File_1        Id     1
1    File_1        Id     2
2    File_1      test    ab
3    File_1      test    cs
4    File_1  category     4
5    File_1  category     3
6    File_2     index     1
7    File_2     index     2
8    File_2    remove    dr
9    File_2    remove    as
10   File_2    stocks     4
11   File_2    stocks     3
12   File_2  category     a
13   File_2  category     b
...

I want to get for all the files the first 2 rows, Melt, and export them in a dataframe.


Solution

  • try this

    import pathlib
    import pandas as pd
    
    files = []
    for filename in pathlib.Path.cwd().iterdir():
        if filename.suffix.lower().startswith('.xls'):
            files.append(filename)
    
    # DataFrame for final result
    empty_df = pd.DataFrame()
    
    # iterating over each file        
    for f in files:
        # read first 2 rows of DataFrame and melt
        temp = pd.melt(pd.read_excel(f).head(2))
        
        # add a new column that assigns the filename
        temp['filename'] = f
        
        # concat with the final result DataFrame
        empty_df = pd.concat([empty_df, temp])
    
    # if column order is relevant you may re-arrange them like so
    empty_df = empty_df[['filename', 'variable', 'value']]
    
    display(empty_df)