I have rows in a df formed by a string that contains several elements separated by commas.
Among the rows, there are words of interest (eg. Car, Bus) and the word None. Also, there are rows that only have the word None.
Here is an example of df:
Col |
---|
Car, None, None, Car, Bus, None |
None |
Bus, Bus, None, Car, Car, None |
None, None, None |
Here is an example of the expected result:
Col | Most common words |
---|---|
Car, Car, Bus | Car (2) |
Bus, Bus, Car, Car | Bus (2), Car (2) |
In short: I need to remove None from the rows with the words of interest; delete rows consisting only of None and finally I need to count the words of interest remaining in the rows?
Is there any way to do this in Python?
You can split the "Col" column with .str.split(", ")
, filter out the None
values, empty lists and count unique items with .value_counts()
:
df.Col = df.Col.str.split(", ").apply(lambda x: [v for v in x if v != "None"])
df = df[df.Col.str.len() > 0]
df["Most common words"] = df.Col.apply(
lambda x: ", ".join(
f"{a} ({b})" for a, b in pd.Series(x).value_counts().to_dict().items()
)
)
df.Col = df.Col.apply(", ".join)
print(df)
Prints:
Col Most common words
0 Car, Car, Bus Car (2), Bus (1)
2 Bus, Bus, Car, Car Bus (2), Car (2)
df
used:
Col
0 Car, None, None, Car, Bus, None
1 None
2 Bus, Bus, None, Car, Car, None
3 None, None, None