I am working on a text problem where I have my pandas dataframe holding many columns out of which one consists of paragraphs. What I need in output are 3 columns as defined -
I account for a word if it is separated by a space.Looking for an answer using python apply-map
.
Here's a sample input data -
df = pd.DataFrame({'text':[
"that's not where the biggest opportunity is - it's with heart failure drug - very very huge market....",
"Of course! I just got diagnosed with congestive heart failure and type 2 diabetes. I smoked for 12 years and ate like crap for about the same time. I quit smoking and have been on a diet for a few weeks now. Let me assure you that I'd rather have a coke, gummi bears, and a bag of cheez doodles than a pack of cigs right now. Addiction is addiction.",
"STILLWATER, Okla. (AP) ? Medical examiner spokeswoman SpokesWoman: Oklahoma State player Tyrek Coger died of enlarged heart, manner of death ruled natural."
]})
df
text
0 that's not where the biggest opportunity is - ...
1 Of course! I just got diagnosed with congestiv...
2 STILLWATER, Okla. (AP) ? Medical examiner spok...
Here is the expected output -
text word_count word_length words
0 that's not where the biggest opportunity is - ... 1 11 opportunity
1 Of course! I just got diagnosed with congestiv... 1 10 congestive
2 STILLWATER, Okla. (AP) ? Medical examiner spok... 2 11 spokeswoman SpokesWoman
One possible solution using apply-map
-
import nltk
import pandas as pd
# Reading df and proceeding with code
expanded_text = df.text.apply(lambda x: ' '.join(nltk.word_tokenize(x))).str.split(" ", expand=True)
df.word_length = expanded_text.applymap(lambda x: len(str(x)) if x != None else 0).max(axis=1)
i = 1
for idx, val in enumerate(expanded_text.itertuples()):
temp = expanded_text.iloc[idx:idx + i, :].applymap(lambda x: True if len(str(x)) == df.loc[idx, 'word_length'] else False if x != None else False).T
idx_ = temp.index[temp[idx] == True].values
words = " ".join(expanded_text.iloc[idx:idx + i, idx_].values.tolist()[0])
df.loc[idx, 'words'] = words
df.loc[idx, 'word_count'] = len(words.split())
i += 1