Search code examples
pythonpandasdataframeconditional-statements

Minimum Row Count condition based on values of all columns in pandas DataFrame


Suppose I have three columns in a pandas DataFrame without any null or empty values.

  • Facility always has unique values per item.
  • An item can have one or more vendors associated with it.
  • The same vendor can show up more than once for different facilities for a given item.
  • A facility is never associated with more than one vendor for a given item.
  • Item values are ordered.
  • Every row is unique.

Example of the data:

Column Vendor Column Item Column Facility
V1 I1 F1
V1 I1 F2
V2 I1 F4
V1 I2 F1
V1 I2 F2
V2 I2 F3
V3 I2 F5
V3 I2 F6
V3 I3 F3
V1 I4 F2
V4 I4 F4
V4 I4 F5
V1 I5 F1
V1 I5 F4

How do I create a DataFrame that contains the following?: "For every item, if there is more than one vendor associated with that item, give me all rows for that item (otherwise exclude all rows associated with that item)."

In this example, I would want a DataFrame of all rows for I1, I2, and I4 because they have more than 1 unique vendor associated with them, and I would exclude all rows involving I3 and I5 as there is only one vendor associated with them.

I tried to create a list of unique vendors for each item loop with a conditional of len(Appended_Vendor_List_var) > 1, but so far no luck.


Solution

  • You can use a groupby.transform('nunique') and boolean indexing:

    out = df[df.groupby('Column Item')['Column Vendor'].transform('nunique').gt(1)]
    

    Output:

       Column Vendor Column Item Column Facility
    0             V1          I1              F1
    1             V1          I1              F2
    2             V2          I1              F4
    3             V1          I2              F1
    4             V1          I2              F2
    5             V2          I2              F3
    6             V3          I2              F5
    7             V3          I2              F6
    9             V1          I4              F2
    10            V4          I4              F4
    11            V4          I4              F5
    

    Intermediate:

       Column Vendor Column Item Column Facility  nunique  nunique>1
    0             V1          I1              F1        2       True
    1             V1          I1              F2        2       True
    2             V2          I1              F4        2       True
    3             V1          I2              F1        3       True
    4             V1          I2              F2        3       True
    5             V2          I2              F3        3       True
    6             V3          I2              F5        3       True
    7             V3          I2              F6        3       True
    8             V3          I3              F3        1      False
    9             V1          I4              F2        2       True
    10            V4          I4              F4        2       True
    11            V4          I4              F5        2       True
    12            V1          I5              F1        1      False
    13            V1          I5              F4        1      False
    

    Variant using drop_duplicates and isin:

    drop = (df.drop_duplicates(['Column Vendor', 'Column Item'])['Column Item']
              .drop_duplicates(keep=False)
           )
    
    out = df[~df['Column Item'].isin(drop)]