I want to replace the first element of each rdd list.
First I convert rdd string to rdd list with:
ff = rdd.map(lambda x: x.split(","))
simpleRDD = ff.map(lambda x: x)
print("Partitions structure: {}".format(simpleRDD.glom().collect()))
Partitions structure (example): [[['2020-05-22 12:36:12','240144','54'], ['2020-05-22 12:36:12','32456','64']]]
I want to replace the first element of each rdd list i.e.'2020-05-22 12:36:12','2020-05-22 12:36:12' with a different value.
I have tried replaceRDD = simpleRDD.map(lambda a: ("new" if a[0] else "new"))
but this replace all elements with "new".
How to achive something like this:
Partitions structure (example): [[['myvalue','240144','54'], ['myvalue','32456','64']]]
Suppose you have an rdd like so:
words = spark.sparkContext.parallelize([["scala", "java", "hadoop"],[ "spark", "akka","spark"], ["hadoop", "pyspark", "kafka"]])
And you want to convert the first element to something, I will take an example of replacing the first element/word to it's length:
def len_replace(string):
return len(string)
words.map(lambda x: ([len_replace(x[0])] + x[1:])).take(3)
# Output
# [[5, 'java', 'hadoop'], [5, 'akka', 'spark'], [6, 'pyspark', 'kafka']]
You replace the first element by passing x[0]
to the function, and you need the rest of the elements also so you add another term to the tuple which is x[1:]
, which says give me all elements from index 1 till the end.