I have pandas dataframe and want to split the text
column in such a way that each row has just two words. when splitting, I need to maintain the order so that I can combine them together based on line
. Is there efficient way to do this. I can do list comprehension but was looking at more efficient way. Thanks
df = pd.DataFrame({'col1':[22,23,44], 'col2': ['rr','gg','xx'], 'text': ['this is a sample text', 'this is another one','third example is a longer text']})
Using str.findall
, explode
, and groupby.cumcount
:
out = (df.assign(text=df['text'].str.findall(r'(\S+(?:\s+\S+)?)'))
.explode('text')
.assign(line=lambda d: d.groupby(level=0).cumcount())
)
Regexes variant to handle any number of words:
N=2
out = (df.assign(text=df['text'].str.findall(fr'((?:\S+\s+){{,{N-1}}}(?:\S+))\s*'))
.explode('text')
.assign(line=lambda d: d.groupby(level=0).cumcount())
)
Alternative with itertools
' batched
recipe:
from itertools import islice
def batched(iterable, n):
"Batch data into tuples of length n. The last batch may be shorter."
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while batch := tuple(islice(it, n)):
yield batch
N = 2
out = (df.assign(text=df['text'].map(lambda x: list(map(' '.join, batched(x.split(), N)))))
.explode('text')
.assign(line=lambda d: d.groupby(level=0).cumcount())
)
Output:
col1 col2 text line
0 22 rr this is 0
0 22 rr a sample 1
0 22 rr text 2
1 23 gg this is 0
1 23 gg another one 1
2 44 xx third example 0
2 44 xx is a 1
2 44 xx longer text 2
Example output with N=4
:
col1 col2 text line
0 22 rr this is a sample 0
0 22 rr text 1
1 23 gg this is another one 0
2 44 xx third example is a 0
2 44 xx longer text 1