I have two strings that I want to word tokenize and then compare for differences
s1 = 'one two shmoo'
s2 = 'one one two'
My first thought was to turn them both into collections.Counter
objects, wrap them in pd.Series
, and subtract the difference.
import pandas as pd
from collections import Counter
def counter_series(s):
return pd.Series(Counter(s.split(' ')))
counter_series(s2) - counter_series(s1)
But the output shows that this difference doesn't provide a count for words that aren't present in both strings:
one 1.0
shmoo NaN
two 0.0
dtype: float64
How can you include the missing counts? E.g. in the output above shmoo
should also be 1. The solution doesn't have to use pandas
.
Use sub
with fill_value = 0
:
counter_series(s2).sub(counter_series(s1), fill_value=0)
Output:
one -1.0
shmoo 1.0
two 0.0
dtype: float64
And, you can add .abs() to get absolute value of differences:
counter_series(s2).sub(counter_series(s1), fill_value=0).abs()
Output:
one 1.0
shmoo 1.0
two 0.0
dtype: float64
However, I would use value_counts
instead of import Counters from collections.
def count_series(x):
s = pd.Series(x.split(' '))
return s.value_counts()