Search code examples
pythonnlpdata-preprocessing

Limit the number of repetitive consecutive characters in a string


I'm preprocessing tweets and need to set the maximum limit of the number of consecutive occurrences of "@USER" to 3 times. For example, a tweet like this:

this tweet contains hate speech @USER@USER@USER@USER@USER about a target group @USER@USER

after processing, should look like this:

this tweet contains hate speech @USER@USER@USER about a target group @USER@USER

I was able to achieve the desired result with a while loop, however, I'm wondering if someone knows how to do it a simpler way. Thanks!

tweets = ["this tweet contains hate speech @USER@USER@USER@USER@USER about a target group @USER@USER"]

K = "@USER"
limit = 3
i = 0
for tweet in tweets: 
    tweet = tweet.split(' ')

    while i < len(tweet):
        if tweet[i].count(K) > limit:
            tweet[i] = K*int(limit)
            tweet = " ".join(str(item) for item in tweet)
        i +=1

print(tweet)
# Output: this tweet contains hate speech @USER@USER@USER about a target group @USER@USER

Solution

  • You can just use re to replace 4 or more occurrences of @USER with three:

    tweet = "this tweet contains hate speech @USER@USER@USER@USER@USER about a target group @USER@USER"
    re.sub(r'(@USER){4,}', r'@USER@USER@USER', tweet)