Search code examples
apache-sparkpysparkrdd

replace specific element of rdd in pyspark


I want to replace the first element of each rdd list.

First I convert rdd string to rdd list with:

ff = rdd.map(lambda x:  x.split(","))
simpleRDD = ff.map(lambda x: x) 
print("Partitions structure: {}".format(simpleRDD.glom().collect()))

Partitions structure (example): [[['2020-05-22 12:36:12','240144','54'], ['2020-05-22 12:36:12','32456','64']]]

I want to replace the first element of each rdd list i.e.'2020-05-22 12:36:12','2020-05-22 12:36:12' with a different value.

I have tried replaceRDD = simpleRDD.map(lambda a: ("new" if a[0] else "new")) but this replace all elements with "new".

How to achive something like this:

Partitions structure (example): [[['myvalue','240144','54'], ['myvalue','32456','64']]]


Solution

  • Suppose you have an rdd like so:

    words = spark.sparkContext.parallelize([["scala", "java", "hadoop"],[ "spark", "akka","spark"], ["hadoop", "pyspark", "kafka"]])
    

    And you want to convert the first element to something, I will take an example of replacing the first element/word to it's length:

    def len_replace(string):
        return len(string)
    
    words.map(lambda x: ([len_replace(x[0])] + x[1:])).take(3)
    # Output
    # [[5, 'java', 'hadoop'], [5, 'akka', 'spark'], [6, 'pyspark', 'kafka']]
    

    You replace the first element by passing x[0] to the function, and you need the rest of the elements also so you add another term to the tuple which is x[1:], which says give me all elements from index 1 till the end.