I am new to python and also to stackoverlfow. I have a csv file with three columns (ID, Date_Of_creation, Text). There are almost 25,000 entries in the file. I have to remove the duplicate tweets (text column) and the code below works fine to remove duplicates:
import csv
csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')
csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()
for row in csvReader:
#print(row[3])
if row[3] in cleanData: continue
cleanData.add(row[3])
csvWriter.writerow(row)
print(cleanData)
csvOutputFile.close()
csvInputFile.close()
This code is removing all the duplicates with corresponding IDS and creation date. As a second step of the analysis, i noticed that there are some retweets that don't have the original tweets in the data set. I want to keep those retweets. In simple, i want to remove all the duplicates, whether its a tweet or retweet, from the Text column. For Example:
"It will not be easy for them to handle the situation at this stage:…"
"RT @ReutersLobby: It will not be easy for them to handle the situation at this stage:…"
As the above tweet and retweet shows that "RT @ReutresLobby:" is extra in retweet. So the above code will not remove this retweet from the final set. I want to remove all such tweets that are a copy of a another tweet because the focus is on text of the tweet and creation time and not on other fields. I tried to search for it but could not find anything related on the forum.I hope someone will help me out with this problem..
I think it's a pretty quick fix:
import csv
import re
csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')
csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()
for row in csvReader:
#print(row[3])
if row[3] in cleanData or re.sub('^RT @.*: ', '', row[3]) in cleanData:
continue
cleanData.add(row[3])
csvWriter.writerow(row)
print(cleanData)
csvOutputFile.close()
csvInputFile.close()
The condition I added sees if the tweet, when stripped of the retweet prefix, exists already in the cleaned set.