I'm trying out polars (Rust) on kaggle titanic dataset (https://www.kaggle.com/competitions/titanic/data), and there is a column called "Cabin" where there are null values.
I've been trying to use fill_null
and setting that to the mode of that column, however it doesn't seem to change it?
fn main() -> Result<()> {
let q = LazyCsvReader::new("data/train.csv")
.has_header(true)
.finish()?;
let df = q
.collect()?;
let fill = df.clone()
.lazy()
.with_columns([col("Cabin").fill_null(col("Cabin").mode())])
.collect()?;
println!("{:?}", df.null_count());
println!("{:?}", fill.null_count());
Ok(())
}
The output of that is
shape: (1, 12)
┌─────────────┬──────────┬────────┬──────┬───┬────────┬──────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name ┆ … ┆ Ticket ┆ Fare ┆ Cabin ┆ Embarked │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════════════╪══════════╪════════╪══════╪═══╪════════╪══════╪═══════╪══════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 687 ┆ 2 │
└─────────────┴──────────┴────────┴──────┴───┴────────┴──────┴───────┴──────────┘
shape: (1, 12)
┌─────────────┬──────────┬────────┬──────┬───┬────────┬──────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name ┆ … ┆ Ticket ┆ Fare ┆ Cabin ┆ Embarked │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════════════╪══════════╪════════╪══════╪═══╪════════╪══════╪═══════╪══════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 687 ┆ 2 │
└─────────────┴──────────┴────────┴──────┴───┴────────┴──────┴───────┴──────────┘
Am I missing something here ?
If you check in the csv, the most common element is null, which should count as "the most occurring value." in mode()
So, it appears what's happening is you're saying replace all null with null.
Try picking something other than mode, or do a filter then mode on that result and you should see it replace the values.