I am trying to replace words starting with string in text data stored in multiple rows of a dataframe. The df has 6 columns : date, user, tweet, language, coordinates, place. The replacement takes place in tweet column that has for example row 1 :
« Nous ne sommes pas très favorables au télétravail mais nous avons de super locaux tout neufs » Il manque un babyfoot et les courtiers auront rejoint les rangs de la Start-up Nation https://link.
In row 2 : « Une étude de la Spire Healthcare a révélé, que le #télétravail pouvait avoir un impact sur le #cyclemenstruel, et ce, notamment à cause de la fatigue, du #stress et du manque d’activité physique induits par le travail à distance »
Via @marieclaire_fr https://link.
In row 3 : @IrisDessine @fredchristian__ Mais c'est super pratique car j'ai horreur de passer la serpillière 😁 le mien aspire et lave et franchement c'est devenu mon meilleur ami 😅 il bosse tranquillou quand je suis en télétravail 👍
Etc.
I would like to replace words starting with '@' by '@user' and replace the links (word starting with 'http') by 'http'. All the columns of the df are considered as object. I have tried multiple things :
for individual_word in df["Tweet"]:
#print(individual_word)
if individual_word.startswith('@') and len(individual_word) > 1:
individual_word = '@user'
With this code nothing is happening, no error, no replacement. Another code :
for individual_word in df["Tweet"].split(' '):
#print(individual_word)
if individual_word.split(' ').startswith('@') and len(individual_word) > 1:
individual_word = '@user'
With this code I have the error : 'Series' object has no attribute 'split'. Another code :
for individual_word in df["Tweet"].str.split(' '):
#print(individual_word)
if individual_word.str.split(' ').startswith('@') and len(individual_word) > 1:
individual_word = '@user'
#print(individual_word)
With this code I have the error : 'list' object has no attribute 'str'. I have tried to do the same when the column Tweet is converted as string but nothing changes. Depending on the code tried, I think each row is considered as a list so I have to look for word in list of list starting with '@' and 'http' and replace them. Or, each row is considered as a word and not a sentence. So if the first word starts with '@' it will be changed, but if the word starts with '@' later in the sentence it won't be changed.
I have also tried with list of list. I can have my data in a list called my_list and 3 columns Type, Size, Value. Row 1 of my_list in column Value :
['être', 'favorable', 'télétravail', 'super', 'local', 'neuf', 'manque', 'babyfoot', 'courtier', 'rejoindre', 'rang', 'start', 'nation', 'https://link']
Row 2 of my_list in column Value :
['étude', 'spire', 'healthcare', 'révéler', 'télétravail', 'impact', 'cyclemenstruel', 'cause', 'fatigue', 'stress', 'manque', 'activité', 'physique', 'induire', 'travail', 'distance', '@marieclaire_fr', 'https://link']
Row 3 of my_list in column Value :
['vive', 'télétravail', 'commeunlundi', 'https://link']
I have tried the code :
for each_list in my_list:
#print(each_list)
for each_word in each_list:
#print(each_word)
if each_word.startswith('@') and len(each_word) > 1:
#print(each_word)
each_word = '@user'
I don't have any errors but the word isn't changed in each list of my_list.
Thank you for your help !
You can try with pandas string methods. Also have a look at regex 101 to check which regex works best for your case.
df['tweets'] = df['tweets'].str.replace('@\S+', '@user')
>>>df['tweets']
tweets
0 « Une étude de la Spire Healthcare a révélé, q...
1 @user @user Mais c'est super pratique car j'ai...