I have a large dataframe that looks like this:
df_large = pl.DataFrame({'x':['h1','h2','h2','h3'],
'y':[1,2,3,4],
'ind1':['0/0','1/0','1/1','0/1'],
'ind2':['0/1','0/2','1/1','0/0'] }).lazy()
df_large.collect()
| x | y | ind_1 | ind_2 |
|_______|_______|_______|_________|
| "h1" | 1 | "0/0" | '0/1' |
| "h2" | 2 | "1/0" | '0/2' |
| "h2" | 3 | "1/1" | '1/1' |
| "h3" | 4 | "0/1" | '0/0' |
df_large contains coordinates [x (str), y (int)] and string values for many individuals [ind_1,ind_2,...]. It is very large, so I have to read the CSV file as a lazy dataframe. Additionally, I have a small dataframe that looks like this:
df_rep = pl.DataFrame({'x':['h1','h2','h2'],
'y':[1,2,2],
'ind':['ind1','ind1','ind2']})
df_rep
| x | y | indvs |
|_______|_______|_________|
| "h1" | 1 | "ind_1" |
| "h2" | 2 | "ind_1" |
| "h2" | 2 | "ind_2" |
I need to mutate the values for the columns ind_k in df_large when they appears on df_rep.
I did the following code for that:
for row in df_rep.iter_rows():
df_large = df_large.with_columns(
pl.when(pl.col('x') == row[0],
pl.col('y') == row[1])
.then(pl.col(row[2]).str.replace_all('(.)/(.)','./.'))
.otherwise(pl.col(row[2]))
.alias(row[2])
)
df_large.collect()
| x | y | ind_1 | ind_2 |
|_______|_______|_______|_________|
| "h1" | 1 | "./." | '0/1' |
| "h2" | 2 | "./." | './.' |
| "h2" | 3 | "1/1" | '1/1' |
| "h3" | 4 | "0/1" | '0/0' |
This method, while slow, works for a subset of the larger dataset. However, Polars produces a segmentation fault when applied to the full dataset. I was hoping you could provide feedback on how to resolve this issue. An alternative method to achieve my goal without using iter_rows() would be ideal!
I am a beginner with Polars, and I would greatly appreciate any feedback. I've been stuck on this issue for some time now :(
If you reshape the small frame with .pivot()
df_rep.with_columns(value=True).pivot(on="ind", index=["x", "y"])
shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ x ┆ y ┆ ind1 ┆ ind2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ h1 ┆ 1 ┆ true ┆ null │
│ h2 ┆ 2 ┆ true ┆ true │
└─────┴─────┴──────┴──────┘
You could then match the rows with a left .join()
and put then when/then logic into a single .with_columns()
call.
index = ["x", "y"]
other = df_rep.with_columns(value=True).pivot(on="ind", index=index)
names = other.drop(index).columns
(df_large
.join(other, on=index, how="left")
.with_columns(
pl.when(pl.col(f"{name}_right"))
.then(pl.col(name).str.replace_all(r"(.)/(.)", "./."))
.otherwise(pl.col(name))
for name in names
)
)
shape: (4, 6)
┌─────┬─────┬──────┬──────┬────────────┬────────────┐
│ x ┆ y ┆ ind1 ┆ ind2 ┆ ind1_right ┆ ind2_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╪════════════╪════════════╡
│ h1 ┆ 1 ┆ ./. ┆ 0/1 ┆ true ┆ null │
│ h2 ┆ 2 ┆ ./. ┆ ./. ┆ true ┆ true │
│ h2 ┆ 3 ┆ 1/1 ┆ 1/1 ┆ null ┆ null │
│ h3 ┆ 4 ┆ 0/1 ┆ 0/0 ┆ null ┆ null │
└─────┴─────┴──────┴──────┴────────────┴────────────┘
You can then .drop()
the names_right
columns.