I have a similar problem to how to select all columns from a list in a polars dataframe, but slightly different:
import polars as pl
import numpy as np
import string
rng = np.random.default_rng(42)
nr = 3
letters = list(string.ascii_letters)
uppercase = list(string.ascii_uppercase)
words, groups = [], []
for i in range(nr):
word = ''.join([rng.choice(letters) for _ in range(rng.integers(3, 20))])
words.append(word)
group = rng.choice(uppercase)
groups.append(group)
df = pl.DataFrame(
{
"a_0": np.linspace(0, 1, nr),
"a_1": np.linspace(1, 2, nr),
"a_2": np.linspace(2, 3, nr),
"b_0": np.random.rand(nr),
"b_1": 2 * np.random.rand(nr),
"b_2": 3 * np.random.rand(nr),
"words": words,
"groups": groups,
}
)
print(df)
shape: (3, 8)
┌─────┬─────┬─────┬──────────┬──────────┬──────────┬─────────────────┬────────┐
│ a_0 ┆ a_1 ┆ a_2 ┆ b_0 ┆ b_1 ┆ b_2 ┆ words ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪══════════╪══════════╪══════════╪═════════════════╪════════╡
│ 0.0 ┆ 1.0 ┆ 2.0 ┆ 0.653892 ┆ 0.234362 ┆ 0.880558 ┆ OIww ┆ W │
│ 0.5 ┆ 1.5 ┆ 2.5 ┆ 0.408888 ┆ 0.213767 ┆ 1.833025 ┆ KkeB ┆ Z │
│ 1.0 ┆ 2.0 ┆ 3.0 ┆ 0.423949 ┆ 0.646378 ┆ 0.116173 ┆ NLOAgRxAtjWOHuQ ┆ O │
└─────┴─────┴─────┴──────────┴──────────┴──────────┴─────────────────┴────────┘
I want again to concatenate the columns a_0
, a_1
,... into a column a
, columns b_0
, b_1
,... into a column b
. However, unlike the preceding question, this time a = [a_0; a_1; ...]
. I.e., all the elements of a_0
go first, followed by all the elements of a_1
, etc. All the columns whose name doesn't end with a _
followed by a digit (in this example, words
and groups
) must be duplicated enough times to match the length of a
. Let nr
and nc
be the number of rows/columns in df
. Then the output dataframe must have m*nr
rows (m=3
in this case) and nc-2*(m-1)
columns, i.e.
shape: (9, 4)
┌─────────────────┬────────┬─────┬──────────┐
│ words ┆ groups ┆ a ┆ b │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════════════╪════════╪═════╪══════════╡
│ OIww ┆ W ┆ 0.0 ┆ 0.653892 │
│ KkeB ┆ Z ┆ 0.5 ┆ 0.408888 │
│ NLOAgRxAtjWOHuQ ┆ O ┆ 1.0 ┆ 0.423949 │
│ OIww ┆ W ┆ 1.0 ┆ 0.234362 │
│ KkeB ┆ Z ┆ 1.5 ┆ 0.213767 │
│ NLOAgRxAtjWOHuQ ┆ O ┆ 2.0 ┆ 0.646378 │
│ OIww ┆ W ┆ 2.0 ┆ 0.880558 │
│ KkeB ┆ Z ┆ 2.5 ┆ 1.833025 │
│ NLOAgRxAtjWOHuQ ┆ O ┆ 3.0 ┆ 0.116173 │
└─────────────────┴────────┴─────┴──────────┘
How can I do that?
You can extend this answer to your previous question by @jqurious to include index, such as words
and groups
, as follows:
(
df
.unpivot(index=["words", "groups"])
.with_columns(pl.col("variable").str.replace("_.*", ""))
.with_columns(index = pl.int_range(pl.len()).over("variable"))
.pivot(on="variable", index=["index", "words", "groups"], values="value")
.drop("index")
)
Explanation
Consider a simplified dataset with n = 2
and just three a_*
/ b_*
columns.
shape: (2, 8)
┌─────┬─────┬─────┬──────────┬──────────┬──────────┬──────────────────┬────────┐
│ a_0 ┆ a_1 ┆ a_2 ┆ b_0 ┆ b_1 ┆ b_2 ┆ words ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪══════════╪══════════╪══════════╪══════════════════╪════════╡
│ 0.0 ┆ 1.0 ┆ 2.0 ┆ 0.285304 ┆ 1.261851 ┆ 0.295949 ┆ VUvcCgzrycGaKSve ┆ I │
│ 1.0 ┆ 2.0 ┆ 3.0 ┆ 0.460023 ┆ 1.89468 ┆ 1.042234 ┆ GXFVckCws ┆ O │
└─────┴─────┴─────┴──────────┴──────────┴──────────┴──────────────────┴────────┘
Using index in the initial pl.DataFrame.unpivot
adds the content of the words
/ groups
columns to each row after the unpivot. We also ensure that the index column is created within each group defined by words
, groups
, and variable
.
(
df
.unpivot(index=["words", "groups"])
.with_columns(pl.col("variable").str.replace("_.*", ""))
.with_columns(index = pl.int_range(pl.len()).over("variable"))
.sort("words", "groups", "variable")
)
shape: (12, 5)
┌──────────────────┬────────┬──────────┬──────────┬───────┐
│ words ┆ groups ┆ variable ┆ value ┆ index │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ i64 │
╞══════════════════╪════════╪══════════╪══════════╪═══════╡
│ GXFVckCws ┆ O ┆ a ┆ 1.0 ┆ 0 │
│ GXFVckCws ┆ O ┆ a ┆ 2.0 ┆ 1 │
│ GXFVckCws ┆ O ┆ a ┆ 3.0 ┆ 2 │
│ GXFVckCws ┆ O ┆ b ┆ 0.460023 ┆ 0 │
│ GXFVckCws ┆ O ┆ b ┆ 1.89468 ┆ 1 │
│ GXFVckCws ┆ O ┆ b ┆ 1.042234 ┆ 2 │
│ VUvcCgzrycGaKSve ┆ I ┆ a ┆ 0.0 ┆ 0 │
│ VUvcCgzrycGaKSve ┆ I ┆ a ┆ 1.0 ┆ 1 │
│ VUvcCgzrycGaKSve ┆ I ┆ a ┆ 2.0 ┆ 2 │
│ VUvcCgzrycGaKSve ┆ I ┆ b ┆ 0.285304 ┆ 0 │
│ VUvcCgzrycGaKSve ┆ I ┆ b ┆ 1.261851 ┆ 1 │
│ VUvcCgzrycGaKSve ┆ I ┆ b ┆ 0.295949 ┆ 2 │
└──────────────────┴────────┴──────────┴──────────┴───────┘
Then, the final pivot does no longer group by just index
but also the words
and groups
columns.