Search code examples
pythonpandasmin

Finding n lowest values for each row in a dataframe


I have a large datarframe with 1739 rows and 1455 columns. I want to find the 150 lowest values for each row (Not the the 150 th value but 150 values).

I iterate over rows with a basic for loop.

I tried df.min(axis=1) but it only gives out one min. And also the rolling_min function without success.

Is there any existing function where i can enter the number of values i want to find witn .min?

My ultimate goal is to take the 150 lowest values and create a slope then calculate the area under the curve. Do this for each row and add the areas to obtain a volume.

Example of the the dataframe, I have a df that looks like this:

     -218.7     -218.4    ...          217.2      217.5
0     56.632706  13.638315    ...      76.543000  76.543000
1     56.633455  13.576762    ...      76.543000  76.543000
2    -18.432203 -18.384091    ...      76.543000  76.543000
3    -18.476594 -18.439804    ...      76.543000  76.543000

The header is the '-218.7 ...' which are the coordinates in the x axis of a scan. The data is the height of the scan the y axis. What i need is the 150 lowest values for each rows and there associated column header as i want to make a curve for each row then calculate the area under the curve.

So i need for each line something like this :

         -218.7     -218.4    ... for 150 columns
4    -18.532035 -18.497517    ... for 150 values

I don't think i need to store the header info for each line, a for loop would go trough each row one at a time.


Solution

  • Use .argsort to get the indices of the underlying array sorted. Slice the values and the column Index to get all of the information you need. We'll create a MultiIndex so we can store both the column headers and values in the same DataFrame. The first level will be your nth lowest indicator.

    Example:

    import pandas as pd
    import numpy as np
    
    np.random.seed(1)
    df = pd.DataFrame(np.random.randint(1,100000, (1739, 26)))
    df.columns = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
    
    N = 7  # 150 in your case
    idx = np.argsort(df.values, 1)[:, 0:N]
    
    pd.concat([pd.DataFrame(np.take_along_axis(df.to_numpy(), idx, axis=1), index=df.index),
               pd.DataFrame(df.columns.to_numpy(), index=df.index)],
               keys=['Value', 'Columns'], axis=1)
    

    Output:

          Value                                           Columns                  
              0      1      2      3      4      5      6       0  1  2  3  4  5  6
    0      5193   7752   8445  19947  20610  21441  21759       C  K  U  V  I  G  P
    1       432   3607  16278  17138  19434  26104  33879       R  J  W  C  B  D  G
    2        16   1047   1845   9553  12314  13784  19432       K  S  E  F  M  O  U
    3       244   5272  10836  13682  29237  33230  34448       K  Q  A  S  X  W  G
    4      9765  11275  13160  22808  30870  33484  42760       K  T  L  U  C  D  M
    5      2034   2179   4980   7184  14826  15238  22807       Z  H  F  Q  L  R  X
    ...