This in pyspark where r_parsed is an RDD,
r_parsed = r_parsed.map(lambda x: ([k for k in x.keys()][:3]))
x = r_parsed.collect()[666]
print(x)
['is_quote_status', 'place', 'in_reply_to_user_id']
But then..
r_parsed = r_parsed.map(lambda x: ([k for k in x.keys()][1]))
x = r_parsed.collect()[666]
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 120.0 failed 1 times, most recent failure: Lost task 1.0 in stage 120.0 (TID 241, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/filipe/anaconda3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main process() File "/home/filipe/anaconda3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/filipe/anaconda3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File "", line 5, in IndexError: list index out of range
What sorcery is this? Why can I map a list slice, but not a list index?
This is because the slicing syntax can handle the case where the length of the list is less than the slice length.
For example:
my_list = [1]
print(my_list[:3])
#[1]
However, indexing is not as forgiving:
print(my_list[1])
IndexErrorTraceback (most recent call last)
<ipython-input-2-f6bea31a1a9e> in <module>()
----> 1 print(my_list[1])
IndexError: list index out of range
Somewhere in your rdd
, you have a row where the length of the keys is less than 2.
You may want to first filter:
r_parsed = r_parsed.filter(lambda x: len(x.keys()) > 1).map(lambda x: x.keys()[1])
Or build the logic into your map function:
r_parsed = r_parsed.map(lambda x: x.keys()[1] if len(x.keys()) > 1 else None)