Search code examples
pythonpandasdataframecontains

How to add a specific word to a new column when it is a value in a list within a column


supposed my data set

name what
A    apple[red]
B    cucumber[green]
C    dog
C    orange
D    banana
D    monkey
E    cat
F    carrot
.
.

I want to create and specify a list, and if the column contains a value contained in that list, I want to make the specified value a new column.

list value

fruit = ['apple', 'banana', 'orange']
animal = ['dog', 'monkey', 'cat']
vegetable = ['cucumber', 'carrot']

result what I want

name what     class
A    apple    fruit
B    cucumber vegetable
C    dog      animal
C    orange   fruit
D    banana   fruit
D    monkey   animal
E    cat      animal
F    carrot   vegetable

List values and column values do not 'match' and must be contained.

thank you for reading.


Solution

  • Use Series.map with dictionary created from lists and swapped keys with values with flattening:

    fruit = ['apple', 'banana', 'orange']
    animal = ['dog', 'monkey', 'cat']
    vegetable = ['cucumber', 'carrot']
    
    d = {'fruit':fruit, 'animal':animal,'vegetable':vegetable}
    #http://stackoverflow.com/a/31674731/2901002
    d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
    

    Loop alternative of dictionary comprehension:

    d1 = {}
    for oldk, oldv in d.items():
        for k in oldv:
            d1[k] = oldk
    

    And then:

    df['class'] = df['what'].map(d1)
    #if need values before first [
    #df['class'] = df['what'].str.split('[').str[0].map(d1)
    print (df)
      name      what      class
    0    A     apple      fruit
    1    B  cucumber  vegetable
    2    C       dog     animal
    3    C    orange      fruit
    4    D    banana      fruit
    5    D    monkey     animal
    6    E       cat     animal
    7    F    carrot  vegetable
    

    EDIT: For match by subtrings you can loop by dictionary d, check matching by Series.str.contains for mask and set new values:

    d = {'fruit':fruit, 'animal':animal,'vegetable':vegetable}
    
    for k, v in d.items():
        mask = df['what'].str.contains('|'.join(v))
        df.loc[mask, 'class'] = k
    print (df)
      name             what      class
    0    A       apple[red]      fruit
    1    B  cucumber[green]  vegetable
    2    C              dog     animal
    3    C           orange      fruit
    4    D           banana      fruit
    5    D           monkey     animal
    6    E              cat     animal
    7    F           carrot  vegetable
    

    If possible multiple words is possible use words boundaries:

    for k, v in d.items():
        pat = '|'.join(r"\b{}\b".format(x) for x in v)
        df.loc[ df['what'].str.contains(pat), 'class'] = k
    print (df)
      name             what      class
    0    A       apple[red]      fruit
    1    B  cucumber[green]  vegetable
    2    C              dog     animal
    3    C           orange      fruit
    4    D           banana      fruit
    5    D           monkey     animal
    6    E              cat     animal
    7    F           carrot  vegetable