Search code examples
pythonpandasdataframevectorization

How to vectorize operations in pandas data frame?


import pandas as pd

columns = ['S1', 'S2', 'S3', 'S4', 'S5']

df = pd.DataFrame({'Patient':['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8', 'p8', 'p10'],
                   'S1':[0.7, 0.3, 0.5, 0.8, 0.9, 0.1, 0.9, 0.2, 0.6, 0.3],
                   'S2':[0.2, 0.3, 0.5, 0.4, 0.9, 0.1, 0.9, 0.7, 0.4, 0.3],
                   'S3':[0.6, 0.3, 0.5, 0.8, 0.9, 0.8, 0.9, 0.3, 0.6, 0.3],
                   'S4':[0.2, 0.3, 0.7, 0.8, 0.9, 0.1, 0.9, 0.7, 0.3, 0.3 ],
                   'S5':[0.9, 0.8, 0.5, 0.8, 0.9, 0.7, 0.2, 0.7, 0.6, 0.3 ]})

# vectorized operations in data frame

# get the number of the cells that are >=0.5 for each column
arr1 = df[columns].ge(0.5).sum().to_numpy()

# get the sum the cells that are >=0.5 for each column
arr2 = df[df[columns]>=0.5][columns].sum().to_numpy()

print(arr1)
print(arr2)

How do I get the list of patients or a set of patients for each column in the df like below?

[('p1', 'p3', 'p4', 'p5', 'p7', 'p9'), 
 ('p3', 'p5', 'p7', 'p8'), 
 ('p1', 'p3', 'p4', 'p5', 'p6', 'p7', 'p9'), 
 (...),
 (...)]

Solution

  • The result is not tabular format. You can just use a list comprehension in this case:

    [df.Patient[df[col] >= 0.5].to_list() for col in columns]
    
    #[['p1', 'p3', 'p4', 'p5', 'p7', 'p8'],
    # ['p3', 'p5', 'p7', 'p8'],
    # ['p1', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8'],
    # ['p3', 'p4', 'p5', 'p7', 'p8'],
    # ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p8', 'p8']]