Mutating cells in a large Polars (Python) dataframe with iter_rows yields segmentation fault

I have a large dataframe that looks like this:

df_large = pl.DataFrame({'x':['h1','h2','h2','h3'],
                         'y':[1,2,3,4], 
                         'ind1':['0/0','1/0','1/1','0/1'], 
                         'ind2':['0/1','0/2','1/1','0/0'] }).lazy()
df_large.collect()

|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "0/0" |  '0/1'  |       
| "h2"  |   2   | "1/0" |  '0/2'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  |

df_large contains coordinates [x (str), y (int)] and string values for many individuals [ind_1,ind_2,...]. It is very large, so I have to read the CSV file as a lazy dataframe. Additionally, I have a small dataframe that looks like this:

df_rep = pl.DataFrame({'x':['h1','h2','h2'],
                       'y':[1,2,2], 
                       'ind':['ind1','ind1','ind2']})
df_rep

|   x   |   y   |  indvs  | 
|_______|_______|_________|
| "h1"  |   1   | "ind_1" |     
| "h2"  |   2   | "ind_1" |
| "h2"  |   2   | "ind_2" |

I need to mutate the values for the columns ind_k in df_large when they appears on df_rep.

I did the following code for that:

for row in df_rep.iter_rows():
    df_large = df_large.with_columns(
                    pl.when(pl.col('x') == row[0],
                       pl.col('y')    == row[1])
                       .then(pl.col(row[2]).str.replace_all('(.)/(.)','./.'))
                       .otherwise(pl.col(row[2]))
                      .alias(row[2])
                  )
df_large.collect()

|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "./." |  '0/1'  |       
| "h2"  |   2   | "./." |  './.'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  |

This method, while slow, works for a subset of the larger dataset. However, Polars produces a segmentation fault when applied to the full dataset. I was hoping you could provide feedback on how to resolve this issue. An alternative method to achieve my goal without using iter_rows() would be ideal!

I am a beginner with Polars, and I would greatly appreciate any feedback. I've been stuck on this issue for some time now :(

Solution

If you reshape the small frame with .pivot()

df_rep.with_columns(value=True).pivot(on="ind", index=["x", "y"])

shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ x   ┆ y   ┆ ind1 ┆ ind2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ h1  ┆ 1   ┆ true ┆ null │
│ h2  ┆ 2   ┆ true ┆ true │
└─────┴─────┴──────┴──────┘

You could then match the rows with a left .join() and put then when/then logic into a single .with_columns() call.

index = ["x", "y"]
other = df_rep.with_columns(value=True).pivot(on="ind", index=index)
names = other.drop(index).columns

(df_large
  .join(other, on=index, how="left")
  .with_columns(
     pl.when(pl.col(f"{name}_right"))
       .then(pl.col(name).str.replace_all(r"(.)/(.)", "./."))
       .otherwise(pl.col(name))
       for name in names
  ) 
)

shape: (4, 6)
┌─────┬─────┬──────┬──────┬────────────┬────────────┐
│ x   ┆ y   ┆ ind1 ┆ ind2 ┆ ind1_right ┆ ind2_right │
│ --- ┆ --- ┆ ---  ┆ ---  ┆ ---        ┆ ---        │
│ str ┆ i64 ┆ str  ┆ str  ┆ bool       ┆ bool       │
╞═════╪═════╪══════╪══════╪════════════╪════════════╡
│ h1  ┆ 1   ┆ ./.  ┆ 0/1  ┆ true       ┆ null       │
│ h2  ┆ 2   ┆ ./.  ┆ ./.  ┆ true       ┆ true       │
│ h2  ┆ 3   ┆ 1/1  ┆ 1/1  ┆ null       ┆ null       │
│ h3  ┆ 4   ┆ 0/1  ┆ 0/0  ┆ null       ┆ null       │
└─────┴─────┴──────┴──────┴────────────┴────────────┘

You can then .drop() the names_right columns.