I have several CSVfiles with the same structure:
It is clear to me that to obtain a dataframe with all the data concatted together with polars, I can do something like:
import polars as pl
df = pl.read_csv("data_*.csv")
What I would like to do is to add an extra column to the final dataframe containing the name of the product, e.g.
data | value | product_code |
---|---|---|
2000-01-01 | 1 | product_1 |
2000-01-02 | 2 | product_1 |
2000-01-01 | 3 | product_2 |
2000-01-02 | 4 | product_2 |
2000-01-01 | 5 | product_3 |
I'm aware I can load the files one by one, add the extra column and concat them together afterwards but I was wondering if I'm missing some other way to take advantage of polars performances here.
It seems you're wanting the filename added as a column, e.g.
duckdb.sql("""
from read_csv_auto('data_*.csv', filename = true)
""")
┌────────────┬───────┬────────────────────┐
│ data │ value │ filename │
│ date │ int64 │ varchar │
├────────────┼───────┼────────────────────┤
│ 2000-01-01 │ 1 │ data_product_1.csv │
│ 2000-01-02 │ 2 │ data_product_1.csv │
│ 2000-01-01 │ 3 │ data_product_2.csv │
│ 2000-01-02 │ 4 │ data_product_2.csv │
│ 2000-01-01 │ 4 │ data_product_3.csv │
│ 2000-01-02 │ 5 │ data_product_3.csv │
└────────────┴───────┴────────────────────┘
This has been requested a few times but is yet to be added to Polars: https://github.com/pola-rs/polars/issues/9096
You can replace read_csv
with scan_csv
which delays reading the file and returns a LazyFrame instead.
The frames can be combined with concat which (by default) "computes" LazyFrames in parallel.
from pathlib import Path
# lazyframes
csvs = [
pl.scan_csv(f).with_columns(product_code=pl.lit(f.name))
for f in Path().glob("data_*.csv")
]
# inputs are read in parallel
df = pl.concat(csvs).collect()
shape: (6, 3)
┌────────────┬───────┬────────────────────┐
│ data ┆ value ┆ product_code │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════╪═══════╪════════════════════╡
│ 2000-01-01 ┆ 1 ┆ data_product_1.csv │
│ 2000-01-02 ┆ 2 ┆ data_product_1.csv │
│ 2000-01-01 ┆ 3 ┆ data_product_2.csv │
│ 2000-01-02 ┆ 4 ┆ data_product_2.csv │
│ 2000-01-01 ┆ 4 ┆ data_product_3.csv │
│ 2000-01-02 ┆ 5 ┆ data_product_3.csv │
└────────────┴───────┴────────────────────┘