Search code examples
pythondictionarycounter

Remove Words appears less than 2 times in text' from Pandas Series


I am trying to remove all words that appear from each scalar value in a Pandas Series. What is the best way to do it? Here is my failed attempt:


    from collections import Counter
    df = pd.DataFrame({'text':["The quick brown fox", "jumped over the lazy dog","jumped over the lazy dog"]})
    d=''.join(df['text'][:])
    m=d.split()
    q=Counter(m)
    print (q)
    df['text'].str.split().map(lambda el: " ".join(Counter(el for el in q.elements() if q[el] >= 2)))


output:
    Counter({'over': 2, 'the': 2, 'lazy': 2, 'The': 1, 'quick': 1, 'brown': 1, 'foxjumped': 1, 'dogjumped': 1, 'dog': 1})
    0    over the lazy
    1    over the lazy
    2    over the lazy
    Name: text, dtype: object


Solution

  • from collections import Counter
    
    df = pd.DataFrame({'text':["The quick brown fox", "jumped over the lazy dog","jumped over the lazy dog"]})
    c = Counter(df.text.str.split().explode())
    print( df.text.apply(lambda x: ' '.join(w for w in x.split() if c[w] >= 2).strip()) )
    

    Prints:

    0                            
    1    jumped over the lazy dog
    2    jumped over the lazy dog
    Name: text, dtype: object