Search code examples
pythonpandaslistcounthashtag

Performatic way to count hashtags inside list (pandas)


I have a dataframe with ~7.000.000 rows and a lot of columns.

Each row is a Tweet, and i have a column text with tweet's content.

I created a new column just for hashtags inside text:

df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

So i have a column called hashtags with each row containing a list structure: ['#b747', '#test'].

I would like to count the number of each hashtag but i have a heavy number of rows. What is the most performatic way to do it?


Solution

  • Here are some different approaches, along with timing, ordered by speed (fastest first):

    # setup
    n = 10_000
    df = pd.DataFrame({
        'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
    })
    
    
    # 1. using itertools.chain to build an iterator on the elements of the lists
    from itertools import chain
    %timeit Counter(chain(*df.hashtags))
    # 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    # 2. as per @Psidom comment
    %timeit df.hashtags.explode().value_counts()
    # 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    # 3. using Counter constructor, but specifying an iterator, not a list
    %timeit Counter(h for hl in df.hashtags for h in hl)
    # 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    # 4. iterating explicitly and using Counter().update()
    def count5(s):
        c = Counter()
        for hl in s:
            c.update(hl)
        return c
    %timeit count5(df.hashtags)
    # 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    # 5. using itertools.reduce on Counter().update()
    %timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
    # 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    
    # 6. as per @EzerK
    %timeit Counter(sum(df['hashtags'].values, []))
    # 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Conclusion: the fastest is #1 (using Counter(chain(*df.hashtags))), but the more intuitive and natural #2 (from @Psidom comment) is almost as fast. I would probably go with that. #6 (@EzerK approach) is very slow for large df slow because we are building a new (long) list before passing it as argument to Counter().