Search code examples
pythonpandasdataframerandompartition

Create random partition inside a pandas dataframe and create a field that identifies partitions


I have created the following pandas dataframe:

ds = {'col1':[1.0,2.1,2.2,3.1,41,5.2,5.0,6.1,7.1,10]}
df = pd.DataFrame(data=ds)

The dataframe looks like this:

print(df)

   col1
0   1.0
1   2.1
2   2.2
3   3.1
4  41.0
5   5.2
6   5.0
7   6.1
8   7.1
9  10.0

I need to create a random 80% / 20% partition of the dataset and I also need to create a field (called buildFlag) which shows whether a record belongs to the 80% partition (buildFlag = 1) or belongs to the 20% partition (buildFlag = 0).

For example, the resulting dataframe would like like:

   col1  buildFlag
0   1.0          1
1   2.1          1
2   2.2          1
3   3.1          0
4  41.0          1
5   5.2          0
6   5.0          1
7   6.1          1
8   7.1          1
9  10.0          1

The buildFlag values are assigned randomly.

Can anyone help me, please?


Solution

  • SOLUTION (PANDAS + NUMPY)

    A possible solution, which:

    • First, using np.random.choice to randomly choose 80% of df indices without replacement.

    • The df.index.isin function then checks each row's index to see if it was selected.

    • Finally, np.where assigns a 1 to the Flag column for selected indices and a 0 for the others.

    df.assign(Flag=np.where(
        df.index.isin(np.random.choice(
            df.index, size=int(0.8 * len(df)), 
            replace=False)),
        1, 0))
    

    SOLUTION (PANDAS + SKLEARN)

    Alternatively, we can use scikit-learn's train_test_split function:

    • First, it randomly splits the df's indices into two groups: 80% for training and 20% for testing, as specified by test_size=0.2.

    • The training indices are extracted using [0]. The df.index.isin method then checks which indices belong to the training set, producing a boolean array.

    • Finally, this boolean array is converted to integers (1 for True and 0 for False) using .astype(int).

    from sklearn.model_selection import train_test_split
    
    df.assign(Flag = df.index.isin(
        train_test_split(df.index, test_size=0.2, random_state=42)[0]).astype(int))
    

    Ouput:

       col1  Flag
    0   1.0     0
    1   2.1     1
    2   2.2     1
    3   3.1     1
    4  41.0     1
    5   5.2     1
    6   5.0     1
    7   6.1     1
    8   7.1     1
    9  10.0     0