list_1 = [[6, [3, 8, 7]], [5, [9, 7, 3]], [6, [7, 8, 5]], [5, [6, 7, 2]]]
rdd1 = sc.parallelize(list_1)
newpairRDD = rdd1.partitionBy(2,lambda k: int(k[0]))
print("Partitions structure: {}".format(newpairRDD.glom().collect()))
I want to partition by key.
I am getting
TypeError: 'int' object is not subscriptable
What am I doing wrong?
The partitioning function provided to partitionBy
operates on the key of each entry of the RDD, i.e. the first element of each entry. So you're calling lambda k: int(k[0])
on the integer keys, thus causing the error you encountered.
If you simply want to partition by key, your lambda function should be an identity operation, e.g.
newpairRDD = rdd1.partitionBy(2, lambda x: x)