Search code examples
pythonpython-2.7pandascomparison

Python: comparing two Counter objects with different keys


I have two strings that I want to word tokenize and then compare for differences

s1 = 'one two shmoo'
s2 = 'one one two'

My first thought was to turn them both into collections.Counter objects, wrap them in pd.Series, and subtract the difference.

import pandas as pd
from collections import Counter
def counter_series(s):
    return pd.Series(Counter(s.split(' ')))

counter_series(s2) - counter_series(s1)

But the output shows that this difference doesn't provide a count for words that aren't present in both strings:

one      1.0
shmoo    NaN
two      0.0
dtype: float64

How can you include the missing counts? E.g. in the output above shmoo should also be 1. The solution doesn't have to use pandas.


Solution

  • Use sub with fill_value = 0:

    counter_series(s2).sub(counter_series(s1), fill_value=0)
    

    Output:

    one     -1.0
    shmoo    1.0
    two      0.0
    dtype: float64
    

    And, you can add .abs() to get absolute value of differences:

    counter_series(s2).sub(counter_series(s1), fill_value=0).abs()
    

    Output:

    one      1.0
    shmoo    1.0
    two      0.0
    dtype: float64
    

    However, I would use value_counts instead of import Counters from collections.

    def count_series(x):
       s = pd.Series(x.split(' '))
       return s.value_counts()