Cargo.toml:
[dependencies]
polars = { version = "0.27.2", features = ["lazy"] }
I would expect that any two LazyFrames could be vertically concatenated as long as the columns they have in common had the same or promotable dtypes, with missing columns added in as nulls (like how pandas does it). But evidently they need to have the same columns:
use polars::lazy::dsl::*;
use polars::prelude::{concat, df, DataType, IntoLazy, NamedFrom, NULL};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// "y" intentionally comes before "x" here
let df1 = df!["y" => &[1, 5, 17], "x" => &[1, 2, 3]].unwrap().lazy();
let df2 = df!["x" => &[4, 5]].unwrap().lazy();
println!(
"{:?}",
concat(&[df1, df2], true, true).unwrap().collect()?
);
Ok(())
}
This errors with Error: ShapeMisMatch(Owned("Could not vertically stack DataFrame. The DataFrames appended width 2 differs from the parent DataFrames width 1"))
.
I tried adding the missing "y"
column to df2
:
// everything but this line is the same as above
let df2 = df!["x" => &[4, 5]]
.unwrap()
.lazy()
.with_column(lit(NULL).cast(DataType::Int32).alias("y"));
They have the same columns (albeit in different orders) and dtypes now:
shape: (3, 2)
┌─────┬─────┐
│ y ┆ x │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 5 ┆ 2 │
│ 17 ┆ 3 │
└─────┴─────┘
shape: (2, 2)
┌─────┬──────┐
│ x ┆ y │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪══════╡
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
But they still can't be concatenated. Trying to do so gives the error Error: SchemaMisMatch(Owned("cannot vstack: because column names in the two DataFrames do not match for left.name='y' != right.name='x'"))
. Evidently concat()
requires that the columns be in the same order in the underlying DataFrames.
But I don't think it's possible to enforce any particular column order in LazyFrames (and it really shouldn't need to be because column order is supposed to be immaterial). So, what would be the best way to vertically concatenate these two LazyFrames?
If possible, I'd prefer not to .collect()
them each into Dataframes and then vstack
the DataFrames and call .lazy()
on the result; that seems needlessly complicated. And if I did .collect()
them, I still wouldn't want to have to put the columns in the two DataFrames in the same order before stacking.
Edit:
After digging through the source it's pretty clear that this just isn't implemented. This ultimately gets compiled into a call to DataFrame::vstack_mut
which does not support missing or differently-ordered columns:
pub fn vstack_mut(&mut self, other: &DataFrame) -> PolarsResult<&mut Self> {
if self.width() != other.width() {
if self.width() == 0 {
self.columns = other.columns.clone();
return Ok(self);
}
return Err(PolarsError::ShapeMisMatch(
format!("Could not vertically stack DataFrame. The DataFrames appended width {} differs from the parent DataFrames width {}", self.width(), other.width()).into()
));
}
self.columns
.iter_mut()
.zip(other.columns.iter())
.try_for_each::<_, PolarsResult<_>>(|(left, right)| {
can_extend(left, right)?;
left.append(right).expect("should not fail");
Ok(())
})?;
Ok(self)
}
Well it turns out the answer was rather simple once you know where to look. With feature diagonal_concat
, you unlock diag_concat_lf
(and the eager diag_concat_df
):
pub fn diag_concat_lf<L>(lfs: L, rechunk: bool, parallel: bool) -> PolarsResult<LazyFrame>
where
L: AsRef<[LazyFrame]>,
{ ... }
pub fn diag_concat_df(dfs: &[DataFrame]) -> PolarsResult<dataframe> { ... }