Search code examples
pythonpython-3.xpandaspandas-groupbyfrequency

pandas dataframe group with condition


I have a 3D dataframe with x and y and time as 3rd dimension. The data are 5 indizes of satellite images that were taken at different times. The x and y describes every pixel.

 x        y              time       SIPI       classif
7.620001 -77.849990     2018-04-07  1.011107    2.0
                        2018-10-14  1.023407    2.0
                        2018-12-28  0.045107    3.0
                        2020-01-10  0.351107    2.0
                        2018-06-29  0.351107    2.0
         -77.849899     2018-04-07  1.010777    8.0
                        2018-10-14  0.510562    2.0
                        2018-12-28  1.410766    4.0
                        2020-01-10  1.010666    8.0
                        2018-06-29  2.057068    8.0
         -77.849809     2018-04-07  0.986991    1.0
                        2018-10-14  0.986991    8.0
                        2018-12-28  0.986991    5.0
                        2020-01-10  0.984791    5.0
                        2018-06-29  0.986991    3.0
         -77.849718     2018-04-07  0.975965    10.0
                        2018-10-14  0.964765    7.0
                        2018-12-28  0.975965    10.0
                        2020-01-10  0.975965    10.0
                        2018-06-29  0.975965    3.0
         -77.849627     2018-04-07  1.957747    2.0
                        2018-10-14  0.132445    6.0
                        2018-12-28  0.589677    2.0
                        2020-01-10  1.982445    2.0
                        2018-06-29  3.334456    7.0

I need to group the data and as new column I need the value from column 'classif_rf', which is most frequent in 5 datasets. The values are integers between 1 and 10. I want to add an condition which add only frequency higher than 3.

 x          y           classif
7.620001 -77.849990     2.0
         -77.849899     8.0
         -77.849809     Na
         -77.849718     10.0
         -77.849627     2.0

So as a result I need dataframe where each pixel has a value with highest frequency and when the frequency is lower than 3 there should be a NA value.

Can the pandas.groupby function do that? I thought about value_counts(), but I'm not sure how to implement that to my dataset.

Thank you in advance!


Solution

  • Here is a clunky way to do it:

    # Get the modes per group and count how often they occur
    df_modes = df.groupby(["x", "y"]).agg(
        {
            'classif': [lambda x: pd.Series.mode(x)[0], 
                        lambda x: sum(x == pd.Series.mode(x)[0])]
        }
    ).reset_index()
    # Rename the columns to something a bit more readable
    df_modes.columns = ["x", "y", "classif_mode", "classif_mode_freq"]
    # Discard modes whose frequency was less than 3
    df_modes.loc[df_modes["classif_mode_freq"] < 3, "classif_mode"] = np.nan
    

    Now df_modes.drop("classif_mode_freq", axis=1) will return

              x          y  classif_mode
    0  7.620001 -77.849990           2.0
    1  7.620001 -77.849899           8.0
    2  7.620001 -77.849809           NaN
    3  7.620001 -77.849718          10.0
    4  7.620001 -77.849627           2.0