Search code examples
pythonpandasnlp

looking for an efficient way to split columns in a text in pandas


I have pandas dataframe and want to split the text column in such a way that each row has just two words. when splitting, I need to maintain the order so that I can combine them together based on line. Is there efficient way to do this. I can do list comprehension but was looking at more efficient way. Thanks

df = pd.DataFrame({'col1':[22,23,44], 'col2': ['rr','gg','xx'], 'text': ['this is a sample text', 'this is another one','third example is a longer text']})

enter image description here


Solution

  • Using str.findall, explode, and groupby.cumcount:

    out = (df.assign(text=df['text'].str.findall(r'(\S+(?:\s+\S+)?)'))
             .explode('text')
             .assign(line=lambda d: d.groupby(level=0).cumcount())
           )
    

    Regexes variant to handle any number of words:

    N=2
    out = (df.assign(text=df['text'].str.findall(fr'((?:\S+\s+){{,{N-1}}}(?:\S+))\s*'))
             .explode('text')
             .assign(line=lambda d: d.groupby(level=0).cumcount())
           )
    

    Alternative with itertools' batched recipe:

    from itertools import islice
    
    def batched(iterable, n):
        "Batch data into tuples of length n. The last batch may be shorter."
        # batched('ABCDEFG', 3) --> ABC DEF G
        if n < 1:
            raise ValueError('n must be at least one')
        it = iter(iterable)
        while batch := tuple(islice(it, n)):
            yield batch
    
    N = 2
    out = (df.assign(text=df['text'].map(lambda x: list(map(' '.join, batched(x.split(), N)))))
             .explode('text')
             .assign(line=lambda d: d.groupby(level=0).cumcount())
           )
    

    Output:

       col1 col2           text  line
    0    22   rr        this is     0
    0    22   rr       a sample     1
    0    22   rr           text     2
    1    23   gg        this is     0
    1    23   gg    another one     1
    2    44   xx  third example     0
    2    44   xx           is a     1
    2    44   xx    longer text     2
    

    Example output with N=4:

       col1 col2                 text  line
    0    22   rr     this is a sample     0
    0    22   rr                 text     1
    1    23   gg  this is another one     0
    2    44   xx   third example is a     0
    2    44   xx          longer text     1