Search code examples
python-3.xpysparkrdd

Pyspark: AttributeError: 'dict' object has no attribute 'lookup'


I have an RDD whose top 2 elements are as above:

    dataset_json = sc.textFile("data/my_data.json")
    dataset = dataset_json.map(lambda x: json.loads(x))
    dataset.persist()
    dataset.take(2)

Output:

[{'movie': 'movie_name1',
  'release_date': '2011-01-11T10:26:12Z',
  'actor': 'actor_name1'},
 {'movie': 'movie_name2',
  'release_date': '2010-04-08T04:14:23Z',
  'actor': 'actor_name2'}]

I want to isolate the values related to the release date, but the line below returns:

AttributeError: 'dict' object has no attribute 'lookup'

    dataset2 = dataset.filter(lambda line: line.lookup('release_date')) 
    dataset2.first() 

If I try to identify the key using the following code, the output returns the full dataset, instead of the keys only:

    attributes = dataset.filter (lambda x: x.keys())
    attributes.take(2) 

It returns again the full dataset as output instead of the keys only:

[{'movie': 'movie_name1',
  'release_date': '2011-01-11T10:26:12Z',
  'actor': 'actor_name1'},
 {'movie': 'movie_name2',
  'release_date': '2010-04-08T04:14:23Z',
  'actor': 'actor_name2'}]

Can anybody explain me why the above code doesn’t work, and how I can isolate the release_date? (Final aim of this exercise would be to find the earliest release date). Thanks!


Solution

  • To get all the values for the key 'release_date' just use a map

    dataset.map(lambda x: x.get('release_date')).take(2)
    # Out:
    # ['2011-01-11T10:26:12Z', '2010-04-08T04:14:23Z']
    

    Use a default value get('release_date', 'some_default_value') for for lines with missing 'release_date'.

    To sort:

    dataset.takeOrdered(2, key = lambda x: x.get('release_date'))
    

    (but note that dates are compared as strings)

    lookup() is a function that can be applied to an RDD. But in this case the RDD doesn't contain key-value but dictionaries. One way to use lookup would be to flatten the RDD and make it into a key-value pairs RDD

    dataset.flatMap(lambda x: x.items()).lookup('release_date').take(2)
    # Out:
    # ['2011-01-11T10:26:12Z', '2010-04-08T04:14:23Z']
    

    In your example, you were trying to apply lookup to line, which is a dictionary and doesn't have a lookup method.