Search code examples
pandasdataframegroup-by

Check N number of columns where same condition met between consecutive dates and return number and ID of columns for each group


          Date   MAC     SENS_1   SENS_2     SENS_3      SENS_4     SENS_N
0   2023-06-08  MAC1  79.608346  64.2030  66.181849  149.590573  12.135441
1   2023-06-09  MAC1  80.655564  64.6430  67.636943  109.034809  12.278676
2   2023-06-08  MAC2   5.323341   4.7814  57.231430   51.422297   0.494853
3   2023-06-09  MAC2   5.375561   4.7788  63.701169   88.638871   0.471265
4   2023-06-08  MAC3   3.221366   3.0242  61.071023  113.812544   0.251471
5   2023-06-09  MAC3   3.277577   3.0300  67.280632  133.127433   0.274765
6   2023-06-08  MAC4   3.029806   2.6720  63.310941   72.074345   0.393294
7   2023-06-09  MAC4   3.129871   2.6850  72.727142  100.133454   0.408941
8   2023-06-08  MAC5  18.987204  18.1676  57.660920   98.760819   1.219735
9   2023-06-09  MAC5  19.158136  18.1766  61.598102   93.546477   1.270471
10  2023-06-08  MACN  38.190817  35.2416  68.589171  160.597307   2.531353
11  2023-06-09  MACN  38.393878  34.3196  67.710298  139.854140   2.675000

For this dataframe I would like to know how many SENSORS for each MACHINE creating a HIGH-LOW situation for the latest date.

SENS_1 value of MAC1 for today (2023-06-09) is greater than yesterday's (2023-06-08) value. Also it's the same for SENS_2, SENS_3 and SENS_N

So what I need to return is a list(df) of MACHINEs with the name and numbers of SENSORS that meet the condition:

| MACHINE | SENSORS | N. of SENSORS |
|:---- |:------:| -----:|
| MAC1  | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
| MAC2  | SENS_1,SENS_3,SENS_4    | 3 |
| MAC3  | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
| MAC4 | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
| MAC5 | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
| MACN | SENS_1,SENS_N    | 2 |

I can use:

high_lows = df.loc[(df['MAC1'] == df['MAC1'].shift(-1)) & (df['SENS_5'] < df['SENS_5'].shift(-1)) ,'MAC1']

to identify SENSORS with high-lows one by one. However there are almost 500 machines and 150 sensors.

So I think I need to iterate over columns by selecting SENSORS after MAC column:

df.iloc[:,2:]

and check for the condition.

However this function will run every 2000ms(2secs) so it has to be done with the fastest way possible.

What is the most efficient way of iterating over N columns for N groups and extract groups that match the condition?

Example data with NaNs:

           Date    MAC     SENS_1   SENS_2     SENS_3      SENS_4     SENS_5
82   2023-06-08  MAC41  79.608346      NaN  66.181849  149.590573  12.135441
83   2023-06-09  MAC41  80.655564      NaN  67.636943  109.034809  12.278676

Solution

  • You can use:

    def count(df):
        m = df.iloc[:, 2:].diff().fillna(True).gt(0).all(axis=0)
        return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})
    
    out = df.groupby('MAC', as_index=False).apply(count)
    

    Output:

    >>> out
        MAC                                 Sensors  Count
    0  MAC1          SENS_1, SENS_2, SENS_3, SENS_N      4
    1  MAC2                  SENS_1, SENS_3, SENS_4      3
    2  MAC3  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
    3  MAC4  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
    4  MAC5          SENS_1, SENS_2, SENS_3, SENS_N      4
    5  MACN                          SENS_1, SENS_N      2
    

    Alternative (as you have only today and yesterday rows per machine):

    def count(df):
        m = df.iloc[1, 2:] - df.iloc[0, 2:] > 0
        return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})
    
    out = df.groupby('MAC', as_index=False).apply(count)