How to remove repeating letter in a dataframe?

I have the following string:

"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"

I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?

I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'

Solution

You may try this:

df["Col"] = df["Col"].str.replace(u"h{4,}", "")

Where you may set the number of characters to match in my case 4.

                                        Col
0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1                               Hello World
                     Col
0  hello, I'm today hh  
1            Hello World

I used unicode matching, since you mentioned you are in tweets.