Search code examples
jsonapache-sparkpysparkapache-spark-sqlrdd

corrupted record from json file in pyspark due to False as entry


I have a json file that looks like this:

test= {'kpiData': [{'date': '2020-06-03 10:05',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:10',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:15',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:20',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,}
]}

I want to transfer it to a dataframe object, like this:

rdd = sc.parallelize([test])
jsonDF = spark.read.json(rdd)

This results in a corrupted record. From my understanding the reason for this is, that True and False can't be entries in Python. So I need to tranform these entries prior to the spark.read.json() (to TRUE, true or "True"). test is a dict and rdd is a pyspark.rdd.RDD object. For a datframe object the transformation is pretty straigth forward, but I didn't find a solution for these objects.


Solution

  • spark.read.json expects an RDD of JSON strings, not an RDD of Python dictionaries. If you convert the dictionary to a JSON string, you should be able to read that into a dataframe:

    import json
    
    df = spark.read.json(sc.parallelize([json.dumps(test)]))
    

    Another possible way is to read in the dictionary using spark.createDataFrame:

    df = spark.createDataFrame([test])
    

    which will give a different schema with maps instead of structs.