json apache-spark pyspark apache-spark-sql rdd

corrupted record from json file in pyspark due to False as entry

I have a json file that looks like this:

test= {'kpiData': [{'date': '2020-06-03 10:05',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:10',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:15',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,
   'c': True},
  {'date': '2020-06-03 10:20',
   'a': 'MINIMUMINTERVAL',
   'b': 0.0,}
]}

I want to transfer it to a dataframe object, like this:

rdd = sc.parallelize([test])
jsonDF = spark.read.json(rdd)

This results in a corrupted record. From my understanding the reason for this is, that True and False can't be entries in Python. So I need to tranform these entries prior to the spark.read.json() (to TRUE, true or "True"). test is a dict and rdd is a pyspark.rdd.RDD object. For a datframe object the transformation is pretty straigth forward, but I didn't find a solution for these objects.

Solution

spark.read.json expects an RDD of JSON strings, not an RDD of Python dictionaries. If you convert the dictionary to a JSON string, you should be able to read that into a dataframe:

import json

df = spark.read.json(sc.parallelize([json.dumps(test)]))

Another possible way is to read in the dictionary using spark.createDataFrame:

df = spark.createDataFrame([test])

which will give a different schema with maps instead of structs.