Search code examples
apache-sparkpysparkrdd

How do I replace a character in an RDD using pyspark?


I have and RDD that looks like this:

[['M5126', 'M5416', 'Z4789', 'Z01810', 'S060X6D', 'S9032XA', 'S96912A', 'S72002A', 'S61411A', 'W268XXA', 'Y9269', 'Z23'], ['S62639B', 'M25512', 'M1712', 'M25612', 'M62512', 'S39012D', 'S39012A', 'M25511', 'Z98890', '11', '29', 'Z5189']]

How do I replace the commas to tildes so that my RDD looks like this:

['M51~ M541~ Z4789~ Z01810~ S060X6D~ S9032XA~ S96912~ S72002A~ S61411A~ W268XXA~ Y9269~ Z23~S62639B~ M25512~ M1712~ M25612~ M62512~ S39012D~ S39012A~ M25511~ Z98890~ 11~ 29~ Z5189']

rdd = rdd.map(lambda row: "~".join([str(cd) for cd in row])).reduce(lambda x,y: "~".join([x,y]))

But it makes it into one long string.


Solution

  • Just add a space in the joining string...?

    result = [rdd.map(lambda row: "~ ".join([str(cd) for cd in row])).reduce(lambda x,y: "~ ".join([x,y]))]
    

    which gives

    ['M5126~ M5416~ Z4789~ Z01810~ S060X6D~ S9032XA~ S96912A~ S72002A~ S61411A~ W268XXA~ Y9269~ Z23~ S62639B~ M25512~ M1712~ M25612~ M62512~ S39012D~ S39012A~ M25511~ Z98890~ 11~ 29~ Z5189']