Search code examples
pythonapache-sparkrdd

fail to use collect RDD


Please does anyone knows what is the error in this line of code ? Spend hours searching but didn't succeed to fix it. Thank youu in advance,

labels = RDD.map(lambda (a, b): a).collect()

Syntax error


Solution

  • If you are using python 3 probably it is about tuple unpacking that is not supported in python 3. Also you can check this thread.

    Let's say you have rdd of tuples:

    RDD = spark.sparkContext.range(0, 1).map(lambda a: (a, a))
    

    below code will fail with SyntaxError: invalid syntax

    RDD.map(lambda (a, b): a).collect()
    

    but this will work correctly:

    RDD.map(lambda a: a[0]).collect()