I would like to randomly select a few columns of a 2 dimensional dataframe, and shuffle the values within those columns. I can easily shuffle all values (column-wise) of the dataframe, but I am looking to only do so to a randomly selected few.
For instance, take the 6x6 dataframe below:
0 1 2 3 4 5
0 5 3 7 1 2 9
1 1 7 5 3 0 8
2 0 2 7 1 6 5
3 8 4 2 1 9 7
4 2 9 5 6 3 4
5 7 5 8 2 1 0
Randomly selecting a few of the 6 columns, note the following output:
0 1 2 3 4 5
0 2 9 7 1 2 4
1 5 7 5 3 0 0
2 7 2 7 1 6 5
3 8 3 2 1 9 7
4 1 5 5 6 3 9
5 0 4 8 2 1 8
The above shows the 1st, 2nd and last column shuffled, and all others remain as is.
The following code allows me to shuffle all columns:
import numpy as np
df = np.random.random((6,6))
np.random.random(df)
And, yet, after many attempts, I have been unable to modify this to only select (randomly) a few columns. Any advice will be greatly appreciated. Thank you.
Assuming this input example:
import numpy as np
df = pd.DataFrame(np.arange(4*5).reshape(4, 5, order='F'))
0 1 2 3 4
0 0 4 8 12 16
1 1 5 9 13 17
2 2 6 10 14 18
3 3 7 11 15 19
I would use:
import numpy as np
# random number of columns
n = np.random.randint(0, df.shape[1])
# pick n random columns
cols = np.random.choice(df.columns, 3, replace=False)
# shuffle them independently
df[cols] = df[cols].apply(lambda s: np.random.choice(s, len(s), replace=False))
You can even vectorize the last step with permuted
if efficiency is important:
rng = np.random.default_rng()
# n = rng.integers(0, df.shape[1])
# cols = rng.choice(df.columns, n, replace=False)
df[cols] = rng.permuted(df[cols], axis=0)
Example output:
0 1 2 3 4
0 1 4 11 14 16
1 0 5 8 15 17
2 3 6 10 13 18
3 2 7 9 12 19