Search code examples
pythonpandasdataframeloopsmin

Removing observation with min value in a column out of the dataframe one by one using loop in Python


I have a dataframe "data" that looks like this:

f1 f2 f3
11 34 a
14 10 a
20 12 a
15 19 b
19 29 b
29 30 b

I want to find the minimum value of f2 IF f3 is a. I don't want to find min value of f2 for f3 = a or f3 = b, just when f3 = a. And then I want to remove the observation that is associated with that min value in f2 in the dataframe. So I had this code:

a_part = data[data['f3'] == 'a'
min1 = a_part['f2'].min()
min1 = data['f2'] = min1
data_new_1 = pd.dataframe(data.loc[~min1])

Which works well. And now my dataframe looks like:

f1 f2 f3
11 34 a
20 12 a
15 19 b
19 29 b
29 30 b

However, I want to remove the min value of f2 when f3 = a and the associated observation one by one by using a loop, and having a new dataframe each time. So essentially data_new_2 looks like:

f1 f2 f3
11 34 a
15 19 b
19 29 b
29 30 b

Until there's only b left in f3. I tried to do a loop for it:

for i in range(1,6):
    IN = data_new_i[['f3'] == 'a']
    min1 = a_part['f2'].min() 
    min1 = data_new_i['f2'] == min1
    vars()[data_new_i++] = pd.DataFrame(data.loc[~min1])

And this doesn't work. I'm very unfamiliar with the way Python treats new dataframe name with loop index. I think I have to use dict to put new dataframe in, but I don't know how I can extract a column of a dataframe from a dict, and how I can save the new dataframe into the dict. Can someone help me please?


Solution

  • In my solution ouput is list of DataFrames.

    If there are always unique values in column f2 use loop by index values of sorted column and drop row by minimal value:

    out = []
    data1 = data.sort_values('f2')
    for i in data1.loc[data1['f3'] == 'a', 'f2'].index:
        data = data.drop(i)
        out.append(data)
    print (out)
    [   f1  f2 f3
    0  11  34  a
    2  20  12  a
    3  15  19  b
    4  19  29  b
    5  29  30  b,    f1  f2 f3
    0  11  34  a
    3  15  19  b
    4  19  29  b
    5  29  30  b,    f1  f2 f3
    3  15  19  b
    4  19  29  b
    5  29  30  b]
    

    If possible duplicates and need remove all dupes, like here in first loop all rows with f2=10 use:

    print (data)
       f1  f2 f3
    0  11  10  a
    1  14  10  a
    2  20  12  a
    3  15  19  b
    4  19  29  b
    5  29  30  b
    
    out = []
    data1 = data.sort_values('f2')
    for i, g in data1.groupby(data1.loc[data1['f3'] == 'a', 'f2']):
        data = data.drop(g.index)
        out.append(data)
    print (out)
    [   f1  f2 f3
    2  20  12  a
    3  15  19  b
    4  19  29  b
    5  29  30  b,    f1  f2 f3
    3  15  19  b
    4  19  29  b
    5  29  30  b]
    

    It is not recommended, but possible create DataFrames by groups:

    data1 = data.sort_values('f2')
    for j, (i, g) in enumerate(data1.groupby(data1.loc[data1['f3'] == 'a', 'f2']), 1):
        data = data.drop(g.index)
        globals()[f'data_new_{j}'] = data
    print (data_new_1)
       f1  f2 f3
    2  20  12  a
    3  15  19  b
    4  19  29  b
    5  29  30  b
    
    print (data_new_2)
       f1  f2 f3
    3  15  19  b
    4  19  29  b
    5  29  30  b