Search code examples
pythonstringdataframecountvectorizer

How to remove repeating letter in a dataframe?


I have the following string:

"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"

I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?

I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'


Solution

  • You may try this:

    df["Col"] = df["Col"].str.replace(u"h{4,}", "")
    

    Where you may set the number of characters to match in my case 4.

                                            Col
    0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
    1                               Hello World
                         Col
    0  hello, I'm today hh  
    1            Hello World
    

    I used unicode matching, since you mentioned you are in tweets.