Search code examples
pythondataframepython-polars

How to load multiple files with custom process for each of them?


I have several CSVfiles with the same structure:

  • data_product_1.csv
  • data_product_2.csv
  • data_product_3.csv
  • etc.

It is clear to me that to obtain a dataframe with all the data concatted together with polars, I can do something like:

import polars as pl
df = pl.read_csv("data_*.csv")

What I would like to do is to add an extra column to the final dataframe containing the name of the product, e.g.

data value product_code
2000-01-01 1 product_1
2000-01-02 2 product_1
2000-01-01 3 product_2
2000-01-02 4 product_2
2000-01-01 5 product_3

I'm aware I can load the files one by one, add the extra column and concat them together afterwards but I was wondering if I'm missing some other way to take advantage of polars performances here.


Solution

  • It seems you're wanting the filename added as a column, e.g.

    duckdb.sql("""
    from read_csv_auto('data_*.csv', filename = true)
    """)
    
    ┌────────────┬───────┬────────────────────┐
    │    data    │ value │      filename      │
    │    date    │ int64 │      varchar       │
    ├────────────┼───────┼────────────────────┤
    │ 2000-01-01 │     1 │ data_product_1.csv │
    │ 2000-01-02 │     2 │ data_product_1.csv │
    │ 2000-01-01 │     3 │ data_product_2.csv │
    │ 2000-01-02 │     4 │ data_product_2.csv │
    │ 2000-01-01 │     4 │ data_product_3.csv │
    │ 2000-01-02 │     5 │ data_product_3.csv │
    └────────────┴───────┴────────────────────┘
    

    This has been requested a few times but is yet to be added to Polars: https://github.com/pola-rs/polars/issues/9096

    You can replace read_csv with scan_csv which delays reading the file and returns a LazyFrame instead.

    The frames can be combined with concat which (by default) "computes" LazyFrames in parallel.

    from pathlib import Path
    
    # lazyframes
    csvs = [
        pl.scan_csv(f).with_columns(product_code=pl.lit(f.name))
        for f in Path().glob("data_*.csv")
    ]
    
    # inputs are read in parallel
    df = pl.concat(csvs).collect()
    
    shape: (6, 3)
    ┌────────────┬───────┬────────────────────┐
    │ data       ┆ value ┆ product_code       │
    │ ---        ┆ ---   ┆ ---                │
    │ str        ┆ i64   ┆ str                │
    ╞════════════╪═══════╪════════════════════╡
    │ 2000-01-01 ┆ 1     ┆ data_product_1.csv │
    │ 2000-01-02 ┆ 2     ┆ data_product_1.csv │
    │ 2000-01-01 ┆ 3     ┆ data_product_2.csv │
    │ 2000-01-02 ┆ 4     ┆ data_product_2.csv │
    │ 2000-01-01 ┆ 4     ┆ data_product_3.csv │
    │ 2000-01-02 ┆ 5     ┆ data_product_3.csv │
    └────────────┴───────┴────────────────────┘