Search code examples

How can I classify a column of strings with true and false values by comparing with another column of strings

So I have a column of strings that is listed as "compounds"

Composition (column title)




I have another column that has strings metal elements from the periodic table and i'll call that column "metals"

Elements (column title)




The objective is to check each string from "compounds" with every single string listed in "metals" and if any string from metals is there then it would be classified as true. Any ideas how I can code this?

Example: (if "metals" has Zr, Ag, and Te)

ZrMo3 True

Gd(CuS)3 False

Ba2DyInTe5 True

I recently tried using this code below, but I ended up getting all false

asd = subset['composition'].isin(metals['Elements'])

also tried this code and got all false as well

subset['Boolean'] = subset.apply(lambda x: True if any(word in x.composition for word in metals) else False, axis=1)


  • assuming you are using pandas, you can use a list comprehension inside your lambda since you essentially need to iterate over all elements in the elements list

    import pandas as pd
    elements = ['Li', 'Be', 'Na', 'Te']
    compounds = ['ZrMo3', 'Gd(CuS)3', 'Ba2DyInTe5']
    df = pd.DataFrame(compounds, columns=['compounds'])


    0       ZrMo3
    1    Gd(CuS)3
    2  Ba2DyInTe5
    df['boolean'] = df.compounds.apply(lambda x: any([True if el in x else False for el in elements]))


        compounds  boolean
    0       ZrMo3    False
    1    Gd(CuS)3    False
    2  Ba2DyInTe5     True

    if you are not using pandas, you can apply the lambda function to the lists with the map function

    out = list(
            lambda x: any([True if el in x else False for el in elements]), compounds)


    [False, False, True]

    here would be a more complex version which also tackles the potential errors @Ezon mentioned based on the regular expression matching module re. since this approach is essentially looping not only over the elements to compare with a single compound string but also over each constituent of the compounds I made two helper functions for it to be more readable.

    import re
    import pandas as pd
    def split_compounds(c):
        # remove all non-alphabet elements
        c_split = re.sub(r"[^a-zA-Z]", "", c)
        # split string at capital letters
        c_split = '-'.join(re.findall('[A-Z][^A-Z]*', c_split))
        return c_split
    def compare_compound(compound, element):
        # split compound into list
        compound_list = compound.split('-')
        return any([element == c for c in compound_list])
    # build sample data
    compounds = ['SiO2', 'Ba2DyInTe5', 'ZrMo3', 'Gd(CuS)3']
    elements = ['Li', 'Be', 'Na', 'Te', 'S']
    df = pd.DataFrame(compounds, columns=['compounds'])
    # split compounds into elements
    df['compounds_elements'] = [split_compounds(x) for x in compounds]


        compounds compounds_elements
    0        SiO2               Si-O
    1  Ba2DyInTe5        Ba-Dy-In-Te
    2       ZrMo3              Zr-Mo
    3    Gd(CuS)3            Gd-Cu-S
    # check if any item from 'elements' is in the compounds
    df['boolean'] = df.compounds_elements.apply(
        lambda x: any([True if compare_compound(x, el) else False for el in elements])


        compounds compounds_elements  boolean
    0        SiO2               Si-O    False
    1  Ba2DyInTe5        Ba-Dy-In-Te     True
    2       ZrMo3              Zr-Mo    False
    3    Gd(CuS)3            Gd-Cu-S     True