Search code examples
pythonapache-sparkjupyterazure-notebooks

Jupyter Notebooks Spark RDD split function - remove brackets


I am taking some "columns" from previous RDD, and than want to split second element. Spark wraps it in brackets. how to put them in one line (unjagg them, so remove brackets)? I have spent about 10 hours to find the solution... needs to be done without using dataframe. thanks

separatedRDD =  extractedRDD.map(lambda y: (y[0],y[1].split(' ' , 1),y[2],y[3]))

separatedRDD.take(2) # get output

[(u'2014-03-15:10:10:20',
  [u'Sorrento', u'F41L'],  ############### those are brackets I am talking about...
  u'8cc3b47e-bd01-4482-b500-28f2342679af',
  u'33.6894754264'),
 (u'2014-03-15:10:10:20',
  [u'MeeToo', u'1.0'],
  u'ef8c7564-0a1a-4650-a655-c8bbd5f8f943',
  u'37.4321088904')] 

Solution

  • It's a list, so you just need to flatten it by pulling its elements one by one (assuming the size will always be as small as 2):

    separatedRDD.map(lambda y: (y[0], y[1][0], y[1][1], y[2])).collect()
    

    Result:

    [('2014-03-15:10:10:20',
      'Sorrento',
      'F41L',
      '8cc3b47e-bd01-4482-b500-28f2342679af'),
     ('2014-03-15:10:10:20',
      'MeeToo',
      '1.0',
      'ef8c7564-0a1a-4650-a655-c8bbd5f8f943')]