Search code examples
pythonjsoncsvtweepy

Converting JSONL file to CSV - "JSONDecodeError: Extra data"


I am using tweepy's Streamlistener to collect Twitter Data and the code I am using generates a JSONL file with a bunch of meta data. Now I would like to convert the file into a CSV for which I found a code for just that. Unfortunately I have run into the Error reading:

raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 7833)

I have read through other threads and I reckon it has something to do with json.loads not being able to process multiple parts of data within the json file (which is of course the case for my json list file). How I can circumvent this problem within the code? Or do I have to use a completely different approach to convert the file? (I am using python 3.6, and the tweets I am streaming are mostly in Arabic).

__author__ = 'seandolinar'
import json
import csv
import io

'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''

data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8').read() #reads in the JSON file
data_python = json.loads(data_json)

csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file


fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')

for line in data_python:

#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('created_at'),
       '"' + line.get('text').replace('"','""') + '"', #creates double quotes
       line.get('user').get('screen_name'),
       unicode(line.get('user').get('followers_count')),
       unicode(line.get('user').get('friends_count')),
       unicode(line.get('retweet_count')),
       unicode(line.get('favorite_count'))]

row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')




csv_out.close()

Solution

  • If the data file consists of multiple lines, each of which is a single json object, you can use a generator to decode the lines one at a time.

    def extract_json(fileobj):
        # Using "with" ensures that fileobj is closed when we finish reading it.
        with fileobj:
            for line in fileobj:
                yield json.loads(line)
    

    The only changes to your code is that the data_json file is not read explicitly, and data_python is the result of calling extract_json rather than json.loads. Here's the amended code:

    import json
    import csv
    import io
    
    '''
    creates a .csv file using a Twitter .json file
    the fields have to be set manually
    '''
    
    def extract_json(fileobj):
        """
        Iterates over an open JSONL file and yields
        decoded lines.  Closes the file once it has been
        read completely.
        """
        with fileobj:
            for line in fileobj:
                yield json.loads(line)    
    
    
    data_json = io.open('stream_____.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
    data_python = extract_json(data_json)
    
    csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file
    
    
    fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
    csv_out.write(fields)
    csv_out.write(u'\n')
    
    for line in data_python:
    
        #writes a row and gets the fields from the json object
        #screen_name and followers/friends are found on the second level hence two get methods
        row = [line.get('created_at'),
               '"' + line.get('text').replace('"','""') + '"', #creates double quotes
               line.get('user').get('screen_name'),
               unicode(line.get('user').get('followers_count')),
               unicode(line.get('user').get('friends_count')),
               unicode(line.get('retweet_count')),
               unicode(line.get('favorite_count'))]
    
        row_joined = u','.join(row)
        csv_out.write(row_joined)
        csv_out.write(u'\n')
    
    csv_out.close()