Search code examples
pythonpandasdataframeglob

Retrieving data from multiple files into multiple dataframes


Scenario: I have a list of files in a folder (including the file paths). I am trying to get the content of each of those files into a dataframe (one for each file), then further perform some operations and later merge these dataframes.

From various other questions in SO, I found multiple ways to iterate over the files in a folder and get the data, but all of those I found usually ready the files in a loop and concatenate them into a single dataframe automatically, which does not work for me.

For example:

import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls*']
df = pd.DataFrame()
for f in files_xls:
    data = pd.read_excel(f, 'Sheet1')
    df = df.append(data)

or

import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xls*"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

The only piece of code I could put together from what I found is:

from os.path import isfile, join
import glob
mypath = "/DGMS/Destop/uploaded"
listoffiles = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = (pd.read_excel(f) for f in listoffiles)

This lines run without error, but they appear not to do anything, no variables or created nor changed.

Question: What am I doing wrong here? Is there a better way to do this?


Solution

  • You are really close, need join all data by concat from generator:

    contentdataframes = (pd.read_excel(f) for f in listoffiles)
    df = pd.concat(contentdataframes, ignore_index=True)
    

    If need list of DataFrames:

    contentdataframes = [pd.read_excel(f) for f in listoffiles]