Search code examples
pythonpandasdataframedatatablenumbers

Largest five numbers when there are three columns


a) How to find the largest five SNRs when the ID is repeated? And also I want all these three columns as the output. b) I also want the eliminated lines as the output.

         FIT             ID                   SNR
    1011563.fit,  J16142485-3141000 ,       36   
    1011729.fit,  J17210134-3757437 ,       18   
    1011730.fit,  J17210134-3757437 ,       20   
    1011731.fit,  J17210134-3757437 ,       20   
    1011732.fit,  J17210134-3757437 ,       13   
    1011914.fit,  J17210134-3757437 ,       38   
    1011915.fit,  J17210134-3757437 ,       26   
    1011916.fit,  J17210134-3757437 ,       19   
    1011917.fit,  J17210134-3757437 ,       47   
    1011918.fit,  J17210134-3757437 ,       25 ´´´   
  

  Expected output for a.

                   FITS                    ID  SNR
```8  1011917.fit    J17210134-3757437    47
   5  1011914.fit    J17210134-3757437    38
   0  1011563.fit    J16142485-3141000    36
   6  1011915.fit    J17210134-3757437    26
   9  1011918.fit    J17210134-3757437    25
   2  1011730.fit    J17210134-3757437    20 ´´´

Output b) 
 
```          FITS                    ID  SNR
     1  1011729.fit    J17210134-3757437    18
     6  1011915.fit    J17210134-3757437    26
     7  1011916.fit    J17210134-3757437    19
     8  1011917.fit    J17210134-3757437    47´´´
As you can see SNR "6  1011915.fit    J17210134-3757437    26" and 
                   "8  1011917.fit    J17210134-3757437    47" are repeated. But I want this only as output a and not b.

Solution

  • Use groupby+head on the sorted dataframe to get the indices, then slice:

    idx = df.sort_values(by='SNR', ascending=False).groupby('ID').head(5).index
    
    df2 = df.loc[idx]
    

    output:

                FITS                 ID  SNR
    0   1004234.fits  J16355032-2814188  714
    4   1004238.fits  J16355032-2814188  690
    11  1004245.fits  J16355032-2814188  645
    8   1004242.fits  J16355032-2814188  635
    9   1004243.fits  J16355032-2814188  522
    17  1005114.fits  J22154748+4954052  227
    16  1005113.fits  J22154748+4954052  212
    13  1004476.fits  J22152631+4958343  162
    19  1005116.fits  J22154748+4954052  160
    18  1005115.fits  J22154748+4954052  148
    15  1004478.fits  J22152631+4958343  103
    14  1004477.fits  J22152631+4958343   76
    12  1004475.fits  J22152631+4958343   62
    

    Other rows:

    df3 = df.loc[df.index.difference(idx)]
    

    output:

                FITS                 ID  SNR
    1   1004235.fits  J16355032-2814188  444
    2   1004236.fits  J16355032-2814188  331
    3   1004237.fits  J16355032-2814188  492
    5   1004239.fits  J16355032-2814188  491
    6   1004240.fits  J16355032-2814188  489
    7   1004241.fits  J16355032-2814188  382
    10  1004244.fits  J16355032-2814188  385