Search code examples
pythonregexcsvtwitter

How to deal with quotecharacters within quotecharacters in CSV files?


I am building a twitter scraper in python, which I would like to scrape my home timeline and create a readable CSV file with the tweet ID, tweet creator, timestamp, and tweet content. Tweets often contain commas, (the delimiter I am using) which is not an issue when the tweet content column is wrapped in single quotes (the quotechar I am using) . However, due to the limitations of the twitter api, some tweets contain single quotes and commas, which confuses the CSV reader into treating commas within tweets as delimiters.

I have attempted to use regular expressions to remove or replace the single quotes within the actual quotecharacters I would like to keep, but I have not found a way to do so.

Here is what tweets.txt looks like:

ID,Creator,Timestamp,Tweet
1112783967302844417,twitteruser,Mon Apr 01 18:29:06 +0000 2019,'At Adobe's summit, 'experience' was everywhere'

Here is my python script:

import csv

with open ('tweets.txt','r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter = ',', quotechar="'")
    for line in csv_reader:
        print(line)

I would like to recieve an output like this:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobe^s summit, ^experience^ was everywhere']

But currently, the fact that the tweet content contains single quotes within makes it so that the csv reader recognizes the commas as delimiters, and gives this output:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobes summit', " 'experience' was everywhere'"]

Solution

  • In this case where you know the number of columns in your CSV, and where only the last is free text containing commas, you could use Python's string methods:

    with open ('tweets.txt','r') as file:
        for line in file:
            l = (line.strip()                  # Get rid of newlines
                     .split(",", 3))           # Get four columns
            l[-1] = (l[-1].strip("'")          # Remove flanking single quotes
                          .replace("'", "^"))  # Replace inner single quotes if required
            print(l)
    

    This code as many limitations, and will fit your current case only.