Search code examples
pythonpandasdataframepython-re

How to remove string None from a single string with commas and count the most common words in a row?


I have rows in a df formed by a string that contains several elements separated by commas.

Among the rows, there are words of interest (eg. Car, Bus) and the word None. Also, there are rows that only have the word None.

Here is an example of df:

Col
Car, None, None, Car, Bus, None
None
Bus, Bus, None, Car, Car, None
None, None, None

Here is an example of the expected result:

Col Most common words
Car, Car, Bus Car (2)
Bus, Bus, Car, Car Bus (2), Car (2)

In short: I need to remove None from the rows with the words of interest; delete rows consisting only of None and finally I need to count the words of interest remaining in the rows?

Is there any way to do this in Python?


Solution

  • You can split the "Col" column with .str.split(", "), filter out the None values, empty lists and count unique items with .value_counts():

    df.Col = df.Col.str.split(", ").apply(lambda x: [v for v in x if v != "None"])
    df = df[df.Col.str.len() > 0]
    
    df["Most common words"] = df.Col.apply(
        lambda x: ", ".join(
            f"{a} ({b})" for a, b in pd.Series(x).value_counts().to_dict().items()
        )
    )
    df.Col = df.Col.apply(", ".join)
    print(df)
    

    Prints:

                      Col Most common words
    0       Car, Car, Bus  Car (2), Bus (1)
    2  Bus, Bus, Car, Car  Bus (2), Car (2)
    

    df used:

                                   Col
    0  Car, None, None, Car, Bus, None
    1                             None
    2   Bus, Bus, None, Car, Car, None
    3                 None, None, None