Search code examples
pythonpandasdataframefiltercase-folding

Filtering two columns of a dataframe with filter


I have a dataframe of the following type:

df = pd.DataFrame(
    {
        "Name": [
            [
                " Verbundmörtel ",
                " Compound Mortar ",
                " Malta per stucchi e per incollaggio ",
            ],
            [" StoLevell In Absolute ", " StoLevell In Absolute "],
            [
                " Anhydrit-FlieÃ\x9festrich ",
                " Anhydrite Flowing Screed ",
                " Massetto a base di anidrite ",
            ],
        ],
        "NAME_FILE": [
            "AdhesiveCoveringPlaster_2",
            "AdhesiveMortarLevellInForAEVERO_720",
            "AnhydriteFlowingScreed_20",
        ],
    }
)

I am trying to filter both columns according to certain keywords. The idea is to get a dataframe where I have the two columns, but with only the values that meet the condition.

select_materials = {
    "Plaster": list(
        filter(
            lambda x: "Hist".casefold() in x.casefold()
            or "Plaster".casefold() in x.casefold()
            or "Gips".casefold() in x.casefold(),
            df.FILE,
        )
    )
}

For the sake of convenience, I have only inserted a few lines from the entire file, but I hope my objective is clear. The result should then be a dataframe with the filters applied to both columns. If a filtered word is present in either column of the same row, then the whole row should be kept.


Solution

  • Rather than using filter, I would suggest a more idiomatic way to proceed.

    Suppose you want to filter on the word "Mortar":

    # Simply define two filtering masks, since one column contains lists of strings,
    # whereas the other one simply contains strings
    mask1 = df["Name"].apply(lambda x: "Mortar".casefold() in "".join(x).casefold())
    mask2 = df["NAME_FILE"].apply(lambda x: "Mortar".casefold() in x.casefold())
    
    # If a filtered word is present in either column of the same row,
    # then the whole row should be kept
    print(df.loc[mask1 | mask2, :])
    
                                                    Name                            NAME_FILE
    0  [ Verbundmörtel ,  Compound Mortar ,  Malta p...            AdhesiveCoveringPlaster_2
                                    ^^^^^^
    1  [ StoLevell In Absolute ,  StoLevell In Absolu...  AdhesiveMortarLevellInForAEVERO_720
                                                                  ^^^^^^