While analyzing a code, I've stumbled upon the following snippet:
msk = np.random.rand(len(df)) < 0.8
Variables "msk" and "df" are irrelevant for my question. After doing some research I think this usage is also related to "random" class as well. It gives True with 80% chance and False with 20% chance on random elements. It is done for masking. I understand why it is used but I don't understand how it works. Isn't random method supposed to give float numbers? Why are there boolean statements when we put the method in an interval?
np.random.rand(len(df))
returns an array of uniform random numbers between 0 and 1, np.random.rand(len(df)) < 0.8
will transform it into an array of booleans based on the condition.
As there is a 80% chance to be below 0.8, there is 80% of True values.
A more explicit approach would be to use numpy.random.choice
:
np.random.choice([True, False], p=[0.8, 0.2], size=len(df))
An even better approach, if your goal is to subset a dataframe, would be to use:
df.sample(frac=0.8)
df1 = df.sample(frac=0.8)
df2 = df.drop(df1.index)