Search code examples
pythonpandasdictionarypandas-apply

Creating a new column from two columns using a dictionary in Pandas


I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.

The dataframe is below:

     df_in -> 

       unique_id  myvalue identif
   0      CTA15   19.0     TOP
   1      CTA15   22.0     TOP
   2      CTA15   28.0     TOP
   3      CTA15   18.0     TOP
   4      CTA15   22.4     TOP
   5      AC007    2.0     TOP
   6      AC007    2.3    SDME
   7      AC007    2.0    SDME
   8      AC007    5.0    SDME
   9      AC007    3.0    SDME
   10     AC007   31.4    SDME
   11     AC007    4.4    SDME
   12      CGT6    9.7    BTME
   13      CGT6   44.5    BTME
   14      TVF5    6.7    BTME
   15      TVF5    9.1    BTME
   16      TVF5   10.0    BTME
   17      BGD1    1.0    BTME
   18      BGD1    1.6     NON
   19       GHB   51.0     NON
   20       GHB   54.0     NON
   21       GHB    4.7     NON

So I have created a dictionary based on each group of the 'identif' column as :

    md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}

So my goal is to create a new column, say 'chk', based on the following condition:

If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then I will have 1, otherwise 0.

However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows) using a function as follows:

 def myfilter(df, idCol, valCol, mydict):

   for index,row in df.iterrows():
      for key, value in mydict.items(): 
         if row[idCol] == key and row[valCol] >= value:
             df['chk'] = 1
         elif row[idCol] == key and row[valCol] < value:
             df['chk'] = 0

  return df

Getting the output via the following call:

     df_out = myfilter(df_in, 'identif', 'myvalue', md)

So my output will be like:

     df_out ->

         unique_id  myvalue identif  chk
    0      CTA15     19.0     TOP    0
    1      CTA15     22.0     TOP    1
    2      CTA15     28.0     TOP    1
    3      CTA15     18.0     TOP    0
    4      CTA15     22.4     TOP    1
    5      AC007      2.0     TOP    0
    6      AC007      2.3    SDME    0
    7      AC007      2.0    SDME    0
    8      AC007      5.0    SDME    0
    9      AC007      3.0    SDME    0
    10     AC007     31.4    SDME    1
    11     AC007      4.4    SDME    0
    12      CGT6      9.7    BTME    0
    13      CGT6     44.5    BTME    1
    14      TVF5      6.7    BTME    0
    15      TVF5      9.1    BTME    0
    16      TVF5     10.0    BTME    0
    17      BGD1      1.0    BTME    0
    18      BGD1      1.6     NON    0
    19       GHB     51.0     NON    1
    20       GHB     54.0     NON    1
    21       GHB      4.7     NON    0

This works but extremely inefficient and would like a much better way to do it.


Solution

  • This should be faster:

    def func(identif, value):
        if identif in md:
            if value >= md[identif]:
                return 1.0
            else:
                return 0.0
        else:
            return np.NaN
    
    df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
    

    The timing on this little example:

    CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
    Wall time: 1.66 ms
    

    Your version timing:

    CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
    Wall time: 8.79 ms
    

    Although on such a small example it's not conclusive.