I have a python dataframe composed of 13 columns and 60000 lines, one of these column nammed "Text" (type object) contain quite long text cells :
Text ID AI BI GH JB EQ HE EN MA WE WR
2585 obstetric gynaecologicaladmissions owing abor... 2585 0 0 0 0 0 1 0 0 0 0
507 graphic illustration process flow help organiz... 507 0 0 0 0 0 0 0 0 1 0
Some words in some lines are sticked (like in the frist dataframe line : gynaecologicaladmissions), in order to get rid of this I would like to delete all these case in my entire dataset. I thought about delete, for each row in "Text" column, all word who has more than 13 characters
I've tried this line code :
res.loc[res['Text'].str.len() < 13]
But it only provide as result two empty lines.
How can I solve this problem ?
Let's take a sample dataframe
df
text
0 obstetric gynaecologicaladmissions owing
1 graphic illustration process flow help
2 process flow help
3 illustrationprocess flow
As you have to check words length, you have to split each of the strings by separator (in this case space) and loop through the array and include those words whose length is <= 13. To loop through each of the array you can use apply
def func(x):
res = list()
for word in x:
if len(word) <= 13:
res.append(word)
return " ".join(res)
df['text'] = df['text'].str.split().apply(func)
df
text
0 obstetric owing
1 graphic illustration process flow help
2 process flow help
3 flow