Search code examples
pythondataframenlpstring-length

Delete string's elements in python dataframe according to elements length


I have a python dataframe composed of 13 columns and 60000 lines, one of these column nammed "Text" (type object) contain quite long text cells :

    Text    ID  AI  BI  GH  JB  EQ  HE  EN  MA  WE  WR
2585    obstetric gynaecologicaladmissions owing abor...    2585    0   0   0   0   0   1   0   0   0   0
507     graphic illustration process flow help organiz...   507     0   0   0   0   0   0   0   0   1   0

Some words in some lines are sticked (like in the frist dataframe line : gynaecologicaladmissions), in order to get rid of this I would like to delete all these case in my entire dataset. I thought about delete, for each row in "Text" column, all word who has more than 13 characters

I've tried this line code :

res.loc[res['Text'].str.len() < 13]

But it only provide as result two empty lines.

How can I solve this problem ?


Solution

  • Let's take a sample dataframe

    df
    
        text
    0   obstetric gynaecologicaladmissions owing
    1   graphic illustration process flow help
    2   process flow help
    3   illustrationprocess flow
    

    As you have to check words length, you have to split each of the strings by separator (in this case space) and loop through the array and include those words whose length is <= 13. To loop through each of the array you can use apply

    def func(x):
        res = list()
        for word in x:
            if len(word) <= 13:
                res.append(word)
        return " ".join(res)
        
    df['text'] = df['text'].str.split().apply(func)
    df
        
         text
    0   obstetric owing
    1   graphic illustration process flow help
    2   process flow help
    3   flow