Search code examples
jsonapache-sparkpysparkapache-spark-sqlrdd

Strings getting converted to null when writing JSON representation of RDD


I am trying to write RDD which is structure like

(int , ListofList , ListofListofList)

Something like this

(49807360, [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , [111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , 111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123],[111206019,'ABC','XYZ:RDC' , 'RDC' , 123]])

When I print this is RDD form I see the data correctly. When I used inbuilt library to write it in JSON format I am getting null values in place of strings.

{"user":49807360,"history":[[111206019,null,null,null,123], [111206019,null,null,null,123]],"collection":...}

The line of code I am using to serialize RDD to JSON is

rdd.toDF().toJSON().saveAsTextFile(ouput_file_path)

I have also tried

rdd.toDF().write.json(ouput_file_path,"overwrite","gzip")

Above code was run in spark version 2.0.0


Solution

  • This happens because you use DataFrame as an intermediate step. Spark SQL doesn't support heterogeneous arrays, so values which don't match inferred type (array<bigint>) are replaced by NULL.

    If you really want to go this way, and support heterogeneous structures, you should use tuples which should be mapped to Spark SQL structs, or don't depend on schema inference, and provide desired schema explicitly:

    schema = ...  # type: StructType
    spark.createDataFrame(rdd, schema)
    

    with schema (JSON representation) similar to:

    {'fields': [{'metadata': {}, 'name': '_1', 'nullable': True, 'type': 'long'},
      {'metadata': {},
       'name': '_2',
       'nullable': True,
       'type': {'containsNull': True,
        'elementType': {'fields': [{'metadata': {},
           'name': '_1',
           'nullable': True,
           'type': 'long'},
          {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
         'type': 'struct'},
        'type': 'array'}},
      {'metadata': {},
       'name': '_3',
       'nullable': True,
       'type': {'fields': [{'metadata': {},
          'name': '_1',
          'nullable': True,
          'type': {'fields': [{'metadata': {},
             'name': '_1',
             'nullable': True,
             'type': 'long'},
            {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
            {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
            {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
            {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
           'type': 'struct'}},
         {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'long'},
         {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
         {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
         {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'string'},
         {'metadata': {}, 'name': '_6', 'nullable': True, 'type': 'long'}],
        'type': 'struct'}},
      {'metadata': {},
       'name': '_4',
       'nullable': True,
       'type': {'containsNull': True,
        'elementType': {'fields': [{'metadata': {},
           'name': '_1',
           'nullable': True,
           'type': 'long'},
          {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
          {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
         'type': 'struct'},
        'type': 'array'}}],
     'type': 'struct'}