I have CSV table which is a list of tweets from different users over time. The dataset includes tweets and reposts which are identical except for hashtags or additional comments added by another user. For example:
Column A | Column B |
---|---|
11/03/2022 | We have a new president! |
13/03/2022 | We have a new president! #newpresident |
14/03/2022 | My mom is a president. |
14/03/2022 | RT @user: We have a new president! What is going to happen? |
All the rows that contain "We have a new president!" are seen as duplicate for me and I need to get rid of them, so the original row #1 and #3 are the only ones I need. I tried running this:
import csv
import re
csvInput = open('input.csv', 'r', encoding="utf-8-sig", newline='')
csvOutput = open('output.csv', 'w', encoding="utf-8-sig", newline='')
csvReader = csv.reader(csvInput)
csvWriter = csv.writer(csvOutput)
prevRows = set()
for row in csvReader:
if row[2] in prevRows or re.sub('^RT @.*: ', '', row[2]) in prevRows:
continue
prevRows.add(row[2])
csvWriter.writerow(row)
csvOutput.close()
csvInput.close()
Doesn't do the trick. Is there a way to modify this or a better solution?
pandas module could be quite useful:
import pandas as pd
pd.read_csv('input.csv').loc[~df['Column B'].str.contains('(#.+$|^RT @.+)')].to_csv('output.csv', index=False)
PS. not tested