Search code examples
pandasgroup-by

Check all records in a pandas group specific column make up a predefined list of expected values


I have a dataframe whereby I need to check whether each group contains a full complement of excepted values.

I have tried grouping by Patient ID and using transform to check that each group has all 6 required value(s) Y in Inclusion being: -3, -2, -1, 1, 2, 3 otherwise marked as N

Using all in a lambda doesn't quite get me there

Original df

Patient ID Surg ID Surg Type Surg Date Medical Appt Date Inclusion
1 1 X 2022-09-03 2022-03-04 -3
1 1 X 2022-09-03 2022-05-04 -2
1 1 X 2022-09-03 2022-06-04 -1
1 1 X 2022-09-03 2022-11-04 1
1 1 X 2022-09-03 2022-11-29 2
1 1 X 2022-09-03 2023-01-02 3
2 1 X 2022-10-03 2022-09-01 -2
2 1 X 2022-10-03 2022-09-11 -1
2 1 X 2022-10-03 2022-10-12 1
2 1 X 2022-10-03 2022-10-29 2
2 1 X 2022-10-03 2022-12-12 3

Desired df

Patient ID Surg ID Surg Type Surg Date Medical Appt Date Inclusion IncludeYN
1 1 X 2022-09-03 2022-03-04 -3 Y
1 1 X 2022-09-03 2022-05-04 -2 Y
1 1 X 2022-09-03 2022-06-04 -1 Y
1 1 X 2022-09-03 2022-11-04 1 Y
1 1 X 2022-09-03 2022-11-29 2 Y
1 1 X 2022-09-03 2023-01-02 3 Y
2 1 X 2022-10-03 2022-09-01 -2 N
2 1 X 2022-10-03 2022-09-11 -1 N
2 1 X 2022-10-03 2022-10-12 1 N
2 1 X 2022-10-03 2022-10-29 2 N
2 1 X 2022-10-03 2022-12-12 3 N

Solution

  • If you necessarily have -1 before -2, 1 before 2, etc. (and no other values than those, like comments) then a simple groupby.transform with nunique and checking that you have all 6 values:

    df['IncludeYN'] = np.where(df.groupby(['Patient ID', 'Surg ID'])
                               ['Inclusion'].transform('nunique').eq(6),
                               'Y', 'N')
    

    If you want to compare to a specific set of values with set.issubset and ignore possible external values (like "Excluded" from your other question):

    target = {-3, -2, -1, 1, 2, 3}
    
    df['IncludeYN'] = np.where(df.groupby(['Patient ID', 'Surg ID'])
                               ['Inclusion'].transform(lambda x: target.issubset(x)),
                               'Y', 'N')
    

    Output:

        Patient ID  Surg ID Surg Type   Surg Date Medical Appt Date  Inclusion IncludeYN
    0            1        1         X  2022-09-03        2022-03-04         -3         Y
    1            1        1         X  2022-09-03        2022-05-04         -2         Y
    2            1        1         X  2022-09-03        2022-06-04         -1         Y
    3            1        1         X  2022-09-03        2022-11-04          1         Y
    4            1        1         X  2022-09-03        2022-11-29          2         Y
    5            1        1         X  2022-09-03        2023-01-02          3         Y
    6            2        1         X  2022-10-03        2022-09-01         -2         N
    7            2        1         X  2022-10-03        2022-09-11         -1         N
    8            2        1         X  2022-10-03        2022-10-12          1         N
    9            2        1         X  2022-10-03        2022-10-29          2         N
    10           2        1         X  2022-10-03        2022-12-12          3         N