Search code examples
apache-sparkpysparkrdd

PySpark RDD: Manipulating Inner Array


I have a dataset (for example)

sc = SparkContext()
x =  [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))

The print statement returns [(1, [2, 3, 4, 5])]

I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.

How can I essentially isolate the inner array across my worker nodes to then do the multiplication?


Solution

  • I think you can use map with a lambda function:

    y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
    

    Then y.take(2) returns:

    [(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]