Search code examples
pythonpython-3.xcsvtwitternlp

removing tweets with partial similarity


I am new to python and also to stackoverlfow. I have a csv file with three columns (ID, Date_Of_creation, Text). There are almost 25,000 entries in the file. I have to remove the duplicate tweets (text column) and the code below works fine to remove duplicates:

import csv

csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()

for row in csvReader:
    #print(row[3])
    if row[3] in cleanData: continue
    cleanData.add(row[3])
    csvWriter.writerow(row)

print(cleanData)
csvOutputFile.close()
csvInputFile.close()

This code is removing all the duplicates with corresponding IDS and creation date. As a second step of the analysis, i noticed that there are some retweets that don't have the original tweets in the data set. I want to keep those retweets. In simple, i want to remove all the duplicates, whether its a tweet or retweet, from the Text column. For Example:

"It will not be easy for them to handle the situation at this stage:…"

"RT @ReutersLobby: It will not be easy for them to handle the situation at this stage:…"

As the above tweet and retweet shows that "RT @ReutresLobby:" is extra in retweet. So the above code will not remove this retweet from the final set. I want to remove all such tweets that are a copy of a another tweet because the focus is on text of the tweet and creation time and not on other fields. I tried to search for it but could not find anything related on the forum.I hope someone will help me out with this problem..


Solution

  • I think it's a pretty quick fix:

    import csv
    import re
    
    csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
    csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')
    
    csvReader = csv.reader(csvInputFile)
    csvWriter = csv.writer(csvOutputFile)
    cleanData = set()
    
    for row in csvReader:
        #print(row[3])
        if row[3] in cleanData or re.sub('^RT @.*: ', '', row[3]) in cleanData:
            continue
        cleanData.add(row[3])
        csvWriter.writerow(row)
    
    print(cleanData)
    csvOutputFile.close()
    csvInputFile.close()
    

    The condition I added sees if the tweet, when stripped of the retweet prefix, exists already in the cleaned set.