I have the following string:
"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"
I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?
I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'
You may try this:
df["Col"] = df["Col"].str.replace(u"h{4,}", "")
Where you may set the number of characters to match in my case 4.
Col
0 hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1 Hello World
Col
0 hello, I'm today hh
1 Hello World
I used unicode matching, since you mentioned you are in tweets.