I have almost 1 gb file storing almost .2 mln tweets. And, the huge size of file obviously carries some errors. The errors are shown as
AttributeError: 'int' object has no attribute 'items'
. This occurs when I try to run this code.
raw_data_path = input("Enter the path for raw data file: ")
tweet_data_path = raw_data_path
tweet_data = []
tweets_file = open(tweet_data_path, "r", encoding="utf-8")
for line in tweets_file:
try:
tweet = json.loads(line)
tweet_data.append(tweet)
except:
continue
tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet,
dict)]
from pandas.io.json import json_normalize
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
"created_at", "coordinates",
"user.location", "id"]]
Can a solution be found where those lines where such error occurs can be skipped and continue for the rest of the lines.
The issue here is not with lines in data but with tweet_data itself. If you check your tweet_data, you will find one more elements which are of 'int' datatype (assuming your tweet_data is a list of dictionaries as it only expects "dict or list of dicts").
You may want to check your tweet data to remove values other that dictionaries.
I was able to reproduce with below example for json_normalize document:
Working Example:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
]
json_normalize(data)
Output:
Displays datarame
Reproducing Error:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
1 # *Added an integer to the list*
]
result = json_normalize(data)
Error:
AttributeError: 'int' object has no attribute 'items'
How to prune "tweet_data": Not needed, if you follow update below
Before normalising, run below:
tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]
Update: (for foor loop)
for line in tweets_file:
try:
tweet = json.loads(line)
if isinstance(tweet, dict):
tweet_data.append(tweet)
except:
continue