Search code examples
jsonpandastwitterattributeerror

skipping Attribute error while importing twitter data into pandas


I have almost 1 gb file storing almost .2 mln tweets. And, the huge size of file obviously carries some errors. The errors are shown as AttributeError: 'int' object has no attribute 'items'. This occurs when I try to run this code.

 raw_data_path = input("Enter the path for raw data file: ")
 tweet_data_path = raw_data_path



 tweet_data = []
 tweets_file = open(tweet_data_path, "r", encoding="utf-8")
 for line in tweets_file:
   try:
    tweet = json.loads(line)
    tweet_data.append(tweet)
   except:
    continue


    tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet, 
   dict)]



   from pandas.io.json import json_normalize    
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
                                     "created_at", "coordinates", 
                                     "user.location", "id"]]

Can a solution be found where those lines where such error occurs can be skipped and continue for the rest of the lines.


Solution

  • The issue here is not with lines in data but with tweet_data itself. If you check your tweet_data, you will find one more elements which are of 'int' datatype (assuming your tweet_data is a list of dictionaries as it only expects "dict or list of dicts").

    You may want to check your tweet data to remove values other that dictionaries.

    I was able to reproduce with below example for json_normalize document:

    Working Example:

    from pandas.io.json import json_normalize
    data = [{'state': 'Florida',
             'shortname': 'FL',
             'info': {
                  'governor': 'Rick Scott'
             },
             'counties': [{'name': 'Dade', 'population': 12345},
                         {'name': 'Broward', 'population': 40000},
                         {'name': 'Palm Beach', 'population': 60000}]},
            {'state': 'Ohio',
             'shortname': 'OH',
             'info': {
                  'governor': 'John Kasich'
             },
             'counties': [{'name': 'Summit', 'population': 1234},
                          {'name': 'Cuyahoga', 'population': 1337}]},
           ]
    json_normalize(data)
    

    Output:

    Displays datarame

    Reproducing Error:

    from pandas.io.json import json_normalize
    data = [{'state': 'Florida',
             'shortname': 'FL',
             'info': {
                  'governor': 'Rick Scott'
             },
             'counties': [{'name': 'Dade', 'population': 12345},
                         {'name': 'Broward', 'population': 40000},
                         {'name': 'Palm Beach', 'population': 60000}]},
            {'state': 'Ohio',
             'shortname': 'OH',
             'info': {
                  'governor': 'John Kasich'
             },
             'counties': [{'name': 'Summit', 'population': 1234},
                          {'name': 'Cuyahoga', 'population': 1337}]},
           1  # *Added an integer to the list*
           ]
    result = json_normalize(data)
    

    Error:

    AttributeError: 'int' object has no attribute 'items'
    

    How to prune "tweet_data": Not needed, if you follow update below

    Before normalising, run below:

    tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]
    

    Update: (for foor loop)

    for line in tweets_file:
        try:
            tweet = json.loads(line)
            if isinstance(tweet, dict): 
                tweet_data.append(tweet)
        except:
            continue