Search code examples
pythonpython-polars

Mutating cells in a large Polars (Python) dataframe with iter_rows yields segmentation fault


I have a large dataframe that looks like this:

df_large = pl.DataFrame({'x':['h1','h2','h2','h3'],
                         'y':[1,2,3,4], 
                         'ind1':['0/0','1/0','1/1','0/1'], 
                         'ind2':['0/1','0/2','1/1','0/0'] }).lazy()
df_large.collect()
|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "0/0" |  '0/1'  |       
| "h2"  |   2   | "1/0" |  '0/2'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  | 

df_large contains coordinates [x (str), y (int)] and string values for many individuals [ind_1,ind_2,...]. It is very large, so I have to read the CSV file as a lazy dataframe. Additionally, I have a small dataframe that looks like this:

df_rep = pl.DataFrame({'x':['h1','h2','h2'],
                       'y':[1,2,2], 
                       'ind':['ind1','ind1','ind2']})
df_rep
|   x   |   y   |  indvs  | 
|_______|_______|_________|
| "h1"  |   1   | "ind_1" |     
| "h2"  |   2   | "ind_1" |
| "h2"  |   2   | "ind_2" |

I need to mutate the values for the columns ind_k in df_large when they appears on df_rep.

I did the following code for that:

for row in df_rep.iter_rows():
    df_large = df_large.with_columns(
                    pl.when(pl.col('x') == row[0],
                       pl.col('y')    == row[1])
                       .then(pl.col(row[2]).str.replace_all('(.)/(.)','./.'))
                       .otherwise(pl.col(row[2]))
                      .alias(row[2])
                  )
df_large.collect()
|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "./." |  '0/1'  |       
| "h2"  |   2   | "./." |  './.'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  | 

This method, while slow, works for a subset of the larger dataset. However, Polars produces a segmentation fault when applied to the full dataset. I was hoping you could provide feedback on how to resolve this issue. An alternative method to achieve my goal without using iter_rows() would be ideal!

I am a beginner with Polars, and I would greatly appreciate any feedback. I've been stuck on this issue for some time now :(


Solution

  • If you reshape the small frame with .pivot()

    df_rep.with_columns(value=True).pivot(on="ind", index=["x", "y"])
    
    shape: (2, 4)
    ┌─────┬─────┬──────┬──────┐
    │ x   ┆ y   ┆ ind1 ┆ ind2 │
    │ --- ┆ --- ┆ ---  ┆ ---  │
    │ str ┆ i64 ┆ bool ┆ bool │
    ╞═════╪═════╪══════╪══════╡
    │ h1  ┆ 1   ┆ true ┆ null │
    │ h2  ┆ 2   ┆ true ┆ true │
    └─────┴─────┴──────┴──────┘
    

    You could then match the rows with a left .join() and put then when/then logic into a single .with_columns() call.

    index = ["x", "y"]
    other = df_rep.with_columns(value=True).pivot(on="ind", index=index)
    names = other.drop(index).columns
    
    (df_large
      .join(other, on=index, how="left")
      .with_columns(
         pl.when(pl.col(f"{name}_right"))
           .then(pl.col(name).str.replace_all(r"(.)/(.)", "./."))
           .otherwise(pl.col(name))
           for name in names
      ) 
    )
    
    shape: (4, 6)
    ┌─────┬─────┬──────┬──────┬────────────┬────────────┐
    │ x   ┆ y   ┆ ind1 ┆ ind2 ┆ ind1_right ┆ ind2_right │
    │ --- ┆ --- ┆ ---  ┆ ---  ┆ ---        ┆ ---        │
    │ str ┆ i64 ┆ str  ┆ str  ┆ bool       ┆ bool       │
    ╞═════╪═════╪══════╪══════╪════════════╪════════════╡
    │ h1  ┆ 1   ┆ ./.  ┆ 0/1  ┆ true       ┆ null       │
    │ h2  ┆ 2   ┆ ./.  ┆ ./.  ┆ true       ┆ true       │
    │ h2  ┆ 3   ┆ 1/1  ┆ 1/1  ┆ null       ┆ null       │
    │ h3  ┆ 4   ┆ 0/1  ┆ 0/0  ┆ null       ┆ null       │
    └─────┴─────┴──────┴──────┴────────────┴────────────┘
    

    You can then .drop() the names_right columns.