Search code examples
pythonpandasdataframevectorization

Group by and update based on condition python pandas


Hello I have a pandas dataframe with an ID column and a value column.

I'd like to use vectorized code to "group" each ID and then run a function on each value of each row within the group. If every value for a specific ID returns False from the function I want to delete the ID and all corresponding rows from the original data frame.

def funct1(num):
    num = num**.5
    if num - int(num) == 0:
        return True
    else:
        return False

import pandas as pd
df = {
   'ID':['1','1','1','1','1','2','2','2','2','2','3','3','3','3','3'],
   'Percentage':[7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]}
df = pd.DataFrame(df)

The end result should be:

df = {
   'ID':['1','1','1','1','1','2','2','2','2','2'],
   'Percentage':[7,8,9,10,11,12,13,14,15,16]}
df = pd.DataFrame(df)

Here we see that each row of ID 3 is deleted because there is no value within the group that has a int as a square root.

I also want to maintain the index because in the real data the values are not in order like this example.

The answer should return a df not a groupby object.

Thank you for the help!

Best


Solution

  • You don't need to groupby:

    funct1 = lambda pct: pct.pow(0.5) - pct.pow(0.5).astype(int) == 0
    out = df[df['ID'].isin(df.loc[funct1(df['Percentage']), 'ID'])]
    
    >>> out
      ID  Percentage
    0  1           7
    1  1           8
    2  1           9
    3  1          10
    4  1          11
    5  2          12
    6  2          13
    7  2          14
    8  2          15
    9  2          16
    

    Performance

    # @BENY
    %timeit df.groupby('ID').filter(lambda x : any((x['Percentage']**0.5).map(float.is_integer)))
    1.08 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    # @Corralien
    %timeit df[df['ID'].isin(df.loc[funct1(df['Percentage']), 'ID'])]
    651 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)