Search code examples
pythonpandasdataframetextextract

What is the difference between pandas str.extractall() and pandas str.extract()?


I am trying to find all matched words from a column of strings and a giving word list. If I use pandas str.extract(), I can get the first matched word, since I needs all the matched words, so I think pandas str.extractall() will work, however, I only got a ValueError.

What is the problem here?

 df['findWord'] = df['text'].str.extractall(f"({'|'.join(wordlist)})").fillna('')
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

Solution

  • extract returns the first match. extractall generates one row per match.

    Example, let's match A and the following letter.

    df = pd.DataFrame({'col': ['ABC', 'ADAE']})
    #     col
    # 0   ABC
    # 1  ADAE
    
    df['col'].str.extractall('(A.)')
    

    This created a novel index level named "match" that identifies the match number. Matches from the same row are identified by the same first index level.

    Output:

              0
      match    
    0 0      AB
    1 0      AD
      1      AE
    

    With extract:

    df['col'].str.extract('(A.)')
    

    Output:

        0
    0  AB
    1  AD
    
    aggregating the output of extractall
    (df['col']
     .str.extractall('(A.)')
     .groupby(level='match').agg(','.join)
    )
    

    Output:

               0
    match       
    0      AB,AD
    1         AE