This is the exact python analogue of the following Spark question:
I am reading in a wildcard list of parquet files using (variously) pandas and awswrangler.
Is there a way to retrieve a column containing the original filename of each row loaded into the eventual combined dataframe, exactly as per the Spark version of this quesiton?
Update: This is possibly a way to do it - Reading DataFrames saved as parquet with pyarrow, save filenames in columns
Update2: The present question is the reverse of https://stackoverflow.com/a/59682461/1021819
You'll need to add the filename as a new column to each dataframe as you load them. For example, here is how to do this with a set of CSV files since that is easier to run as an example. You'll follow a similar pattern for parquet files.
from pathlib import Path
import pandas as pd
# write a couple fake csv files to our disk as examples
Path("csv1.csv").write_text("a,b\n1,2\n1,2")
Path("csv2.csv").write_text("b,c\n3,4\n3,4")
all_dfs = []
# for every csv file that we want to load
for f in Path(".").glob("csv*.csv"):
# read the csv
df = pd.read_csv(f)
# add a filename column to the dataframe
df["filename"] = f.name
# store the dataframe to concat later
all_dfs.append(df)
# put together the dataframes for each file
pd.concat(all_dfs)
#> a b filename c
#> 0 1.0 2 csv1.csv NaN
#> 1 1.0 2 csv1.csv NaN
#> 0 NaN 3 csv2.csv 4.0
#> 1 NaN 3 csv2.csv 4.0