I have an RDD whose top 2 elements are as above:
dataset_json = sc.textFile("data/my_data.json")
dataset = dataset_json.map(lambda x: json.loads(x))
dataset.persist()
dataset.take(2)
Output:
[{'movie': 'movie_name1',
'release_date': '2011-01-11T10:26:12Z',
'actor': 'actor_name1'},
{'movie': 'movie_name2',
'release_date': '2010-04-08T04:14:23Z',
'actor': 'actor_name2'}]
I want to isolate the values related to the release date, but the line below returns:
AttributeError: 'dict' object has no attribute 'lookup'
dataset2 = dataset.filter(lambda line: line.lookup('release_date'))
dataset2.first()
If I try to identify the key using the following code, the output returns the full dataset, instead of the keys only:
attributes = dataset.filter (lambda x: x.keys())
attributes.take(2)
It returns again the full dataset as output instead of the keys only:
[{'movie': 'movie_name1',
'release_date': '2011-01-11T10:26:12Z',
'actor': 'actor_name1'},
{'movie': 'movie_name2',
'release_date': '2010-04-08T04:14:23Z',
'actor': 'actor_name2'}]
Can anybody explain me why the above code doesn’t work, and how I can isolate the release_date? (Final aim of this exercise would be to find the earliest release date). Thanks!
To get all the values for the key 'release_date' just use a map
dataset.map(lambda x: x.get('release_date')).take(2)
# Out:
# ['2011-01-11T10:26:12Z', '2010-04-08T04:14:23Z']
Use a default value get('release_date', 'some_default_value')
for for lines with missing 'release_date'.
To sort:
dataset.takeOrdered(2, key = lambda x: x.get('release_date'))
(but note that dates are compared as strings)
lookup()
is a function that can be applied to an RDD. But in this case the RDD doesn't contain key-value but dictionaries. One way to use lookup
would be to flatten the RDD and make it into a key-value pairs RDD
dataset.flatMap(lambda x: x.items()).lookup('release_date').take(2)
# Out:
# ['2011-01-11T10:26:12Z', '2010-04-08T04:14:23Z']
In your example, you were trying to apply lookup
to line
, which is a dictionary and doesn't have a lookup method.