Search code examples
pythonpandasstringreplacesentence

Replace word starting with string anywhere in sentence in each row python


I am trying to replace words starting with string in text data stored in multiple rows of a dataframe. The df has 6 columns : date, user, tweet, language, coordinates, place. The replacement takes place in tweet column that has for example row 1 :

« Nous ne sommes pas très favorables au télétravail mais nous avons de super locaux tout neufs » Il manque un babyfoot et les courtiers auront rejoint les rangs de la Start-up Nation https://link.

In row 2 : « Une étude de la Spire Healthcare a révélé, que le #télétravail pouvait avoir un impact sur le #cyclemenstruel, et ce, notamment à cause de la fatigue, du #stress et du manque d’activité physique induits par le travail à distance »

Via @marieclaire_fr https://link.

In row 3 : @IrisDessine @fredchristian__ Mais c'est super pratique car j'ai horreur de passer la serpillière 😁 le mien aspire et lave et franchement c'est devenu mon meilleur ami 😅 il bosse tranquillou quand je suis en télétravail 👍

Etc.

I would like to replace words starting with '@' by '@user' and replace the links (word starting with 'http') by 'http'. All the columns of the df are considered as object. I have tried multiple things :

for individual_word in df["Tweet"]:
#print(individual_word)
if individual_word.startswith('@') and len(individual_word) > 1:
    individual_word = '@user'

With this code nothing is happening, no error, no replacement. Another code :

for individual_word in df["Tweet"].split(' '):
#print(individual_word)
if individual_word.split(' ').startswith('@') and len(individual_word) > 1:
    individual_word = '@user'

With this code I have the error : 'Series' object has no attribute 'split'. Another code :

for individual_word in df["Tweet"].str.split(' '):
#print(individual_word)
if individual_word.str.split(' ').startswith('@') and len(individual_word) > 1:
    individual_word = '@user'
    #print(individual_word)

With this code I have the error : 'list' object has no attribute 'str'. I have tried to do the same when the column Tweet is converted as string but nothing changes. Depending on the code tried, I think each row is considered as a list so I have to look for word in list of list starting with '@' and 'http' and replace them. Or, each row is considered as a word and not a sentence. So if the first word starts with '@' it will be changed, but if the word starts with '@' later in the sentence it won't be changed.

I have also tried with list of list. I can have my data in a list called my_list and 3 columns Type, Size, Value. Row 1 of my_list in column Value :

['être', 'favorable', 'télétravail', 'super', 'local', 'neuf', 'manque', 'babyfoot', 'courtier', 'rejoindre', 'rang', 'start', 'nation', 'https://link']

Row 2 of my_list in column Value :

['étude', 'spire', 'healthcare', 'révéler', 'télétravail', 'impact', 'cyclemenstruel', 'cause', 'fatigue', 'stress', 'manque', 'activité', 'physique', 'induire', 'travail', 'distance', '@marieclaire_fr', 'https://link']

Row 3 of my_list in column Value :

['vive', 'télétravail', 'commeunlundi', 'https://link']

I have tried the code :

for each_list in my_list:
#print(each_list)
for each_word in each_list:
    #print(each_word)
    if each_word.startswith('@') and len(each_word) > 1:
        #print(each_word)
        each_word = '@user'

I don't have any errors but the word isn't changed in each list of my_list.

Thank you for your help !


Solution

  • You can try with pandas string methods. Also have a look at regex 101 to check which regex works best for your case.

    df['tweets'] = df['tweets'].str.replace('@\S+', '@user')
    >>>df['tweets']
        tweets
    0   « Une étude de la Spire Healthcare a révélé, q...
    1   @user @user Mais c'est super pratique car j'ai...