Search code examples
python-3.xexport-to-csv

Removing partially similar entries based on a value from a single column in CSV


I have CSV table which is a list of tweets from different users over time. The dataset includes tweets and reposts which are identical except for hashtags or additional comments added by another user. For example:

Column A Column B
11/03/2022 We have a new president!
13/03/2022 We have a new president! #newpresident
14/03/2022 My mom is a president.
14/03/2022 RT @user: We have a new president! What is going to happen?

All the rows that contain "We have a new president!" are seen as duplicate for me and I need to get rid of them, so the original row #1 and #3 are the only ones I need. I tried running this:

import csv
import re

csvInput = open('input.csv', 'r', encoding="utf-8-sig", newline='')
csvOutput = open('output.csv', 'w', encoding="utf-8-sig", newline='')

csvReader = csv.reader(csvInput)
csvWriter = csv.writer(csvOutput)
prevRows = set()

for row in csvReader:
    if row[2] in prevRows or re.sub('^RT @.*: ', '', row[2]) in prevRows:
        continue
    prevRows.add(row[2])
    csvWriter.writerow(row)

csvOutput.close()
csvInput.close()

Doesn't do the trick. Is there a way to modify this or a better solution?


Solution

  • pandas module could be quite useful:

    import pandas as pd
    
    pd.read_csv('input.csv').loc[~df['Column B'].str.contains('(#.+$|^RT @.+)')].to_csv('output.csv', index=False)
    

    PS. not tested