Search code examples
pythonpandasmultiple-columnscalculated-columnsindicator

Pandas Create a column based on first occurrence of keyword strings from another column


I am not sure , how challenging the problem is, but this is my problem:

I have a data frame say as below:

         df-in 

      ColA      ColB        ColC     ColD

      PA         1           2         c
      PB         3           3         d
      PA         5           11        x
      Ind1 Suma   20          14        z  
      QA         3            7        a
      QB         3            7        b 
      QC         5            8        c
      QD         5            12       c
     Ind2 Dat   202           3        y
      RA        12            1        a
      RB        13            1        v
      RC        14            1        q
     Ind3 CapT  120           3        t
     Other      10            4        x

So now the problem is that what I have to do is to create a column (say ColN) which will have the Indicator values only ( Ind1 or Ind2 or Ind3 as in ColA) but with the following rule: The ColN will have all Ind1 till ColA has the Ind1 value, , Ind2, between Ind1 & Ind2 and Ind3 between Ind3 and Ind2.

(Just to emphasize here , that Ind1, Ind2, Ind3 can be different strings like 'star', 'planet', 'moon', but they will always remain that. They are not going to change)

So my output df will look like the following:

  df-out

      ColA      ColB        ColC     ColD       ColN

      PA         1           2         c        Ind1
      PB         3           3         d        Ind1
      PA         5           11        x        Ind1
      Ind1 Suma   20          14       z        Ind1
      QA         3            7        a        Ind2
      QB         3            7        b        Ind2
      QC         5            8        c        Ind2
      QD         5            12       c        Ind2
     Ind2 Dat   202           3        y        Ind2
      RA        12            1        a        Ind3
      RB        13            1        v        Ind3
      RC        14            1        q        Ind3
     Ind3 CapT  120           3        t        Ind3
     Other      10            4        x        Ind3

So how can I achieve that? I am trying to iterate over ColA or trying to put it in a list and trying, but nothing seems to work. Any help will be appreciated.


Solution

  • Example

    @RomanPerekhrest and @mozway answers are excellent.

    If you cannot solve your problem with their answers, I think you should have provided a more accurate example to help you solve your problem.

    I will create a new example.

    import pandas as pd
    data1 = {'ColA': ['PA', 'PB', 'star Suma', 'QA', 'moon Dat', 'RA', 'planet CapT']}
    df = pd.DataFrame(data1)
    

    df

        ColA
    0   PA
    1   PB
    2   star Suma
    3   QA
    4   moon Dat
    5   RA
    6   planet CapT
    

    I didn't create colB and colC because they didn't seem necessary.

    Code

    We can group the data based on whether it includes star, moon, or planet.

    kw = ['star', 'moon', 'planet']
    df['ColN'] = df['ColA'].str.extract(r'({})'.format('|'.join(kw))).bfill()
    

    df

        ColA        ColN
    0   PA          star
    1   PB          star
    2   star Suma   star
    3   QA          moon
    4   moon Dat    moon
    5   RA          planet
    6   planet CapT planet
    

    The problem has already been solved by other answers.

    However, I have modified the example and solution to help you apply it to your situation.