Search code examples
apache-sparkpysparkrddpartitioning

RDD pyspark partitionBy - TypeError: 'int' object is not subscriptable


list_1 = [[6, [3, 8, 7]], [5, [9, 7, 3]], [6, [7, 8, 5]], [5, [6, 7, 2]]]

rdd1 = sc.parallelize(list_1)
newpairRDD = rdd1.partitionBy(2,lambda k: int(k[0]))
print("Partitions structure: {}".format(newpairRDD.glom().collect()))

I want to partition by key.

I am getting

TypeError: 'int' object is not subscriptable

What am I doing wrong?


Solution

  • The partitioning function provided to partitionBy operates on the key of each entry of the RDD, i.e. the first element of each entry. So you're calling lambda k: int(k[0]) on the integer keys, thus causing the error you encountered.

    If you simply want to partition by key, your lambda function should be an identity operation, e.g.

    newpairRDD = rdd1.partitionBy(2, lambda x: x)