Search code examples
pythonpandasdataframeparquetaws-data-wrangler

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in pandas/awswrangler?


This is the exact python analogue of the following Spark question:

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

I am reading in a wildcard list of parquet files using (variously) pandas and awswrangler.

Is there a way to retrieve a column containing the original filename of each row loaded into the eventual combined dataframe, exactly as per the Spark version of this quesiton?

Update: This is possibly a way to do it - Reading DataFrames saved as parquet with pyarrow, save filenames in columns

Update2: The present question is the reverse of https://stackoverflow.com/a/59682461/1021819


Solution

  • You'll need to add the filename as a new column to each dataframe as you load them. For example, here is how to do this with a set of CSV files since that is easier to run as an example. You'll follow a similar pattern for parquet files.

    from pathlib import Path
    
    import pandas as pd
    
    # write a couple fake csv files to our disk as examples
    Path("csv1.csv").write_text("a,b\n1,2\n1,2")
    Path("csv2.csv").write_text("b,c\n3,4\n3,4")
    
    all_dfs = []
    
    # for every csv file that we want to load
    for f in Path(".").glob("csv*.csv"):
        
        # read the csv
        df = pd.read_csv(f)
        
        # add a filename column to the dataframe
        df["filename"] = f.name
        
        # store the dataframe to concat later
        all_dfs.append(df)
        
    # put together the dataframes for each file
    pd.concat(all_dfs)
    #>      a  b  filename    c
    #> 0  1.0  2  csv1.csv  NaN
    #> 1  1.0  2  csv1.csv  NaN
    #> 0  NaN  3  csv2.csv  4.0
    #> 1  NaN  3  csv2.csv  4.0