Search code examples
pythonpandasdataframepysparkspark-koalas

Convert list of dict into DataFrame with Koalas


I've tried to convert a list of dicts into a Databricks' Koalas DataFrame but I keep getting the error message:

ArrowInvalid: cannot mix list and non-list, non-null values

Pandas works perfectly (with pd.DataFrame(list)) but because of company restrictions I must use PySpark/Koalas. I've also tried to convert the list into a dictionary and the error persists.

An example of the list:

[{'A': None,
  'B': None,
  'C': None,
  'D': None,
  'E': [],
  ...},
{'A': data,
  'B': data,
  'C': data,
  'D': data,
  'E': None,
  ...}
]

And the dict is like:

{'A': [None,  data,  [],  [],  data],
'B': [None, data, None, [], None],
'C': [None, data, None, [], None],
'D': [None, data, None, [], None],
'E': [[], None, data, [], None]}

Is it possible to get a DataFrame from this? Thanks


Solution

  • You can create a Spark DataFrame using your data without data-manipulation using spark.createDataFrame().

    sdf = spark.createDataFrame(
        data_list,
        T.StructType([
            T.StructField('A', T.ArrayType(T.IntegerType()), True),
            T.StructField('B', T.ArrayType(T.IntegerType()), True),
            T.StructField('C', T.ArrayType(T.IntegerType()), True),
            T.StructField('D', T.ArrayType(T.IntegerType()), True),
            T.StructField('E', T.ArrayType(T.IntegerType()), True),
        ])
    )
    

    Which can then be converted to a Koalas DataFrame using to_koalas().

    >>> sdf.to_koalas()
               A          B          C          D     E
    0       None       None       None       None    []
    
    1  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  None
    

    Additionally, I was able to create a Koalas DataFrame without going through Spark, by modifying your data so that empty lists [] instead have a value of None.

    data_list = [
            {
                'A': None,
                'B': None,
                'C': None,
                'D': None,
                'E': None,
            },
            {
                'A': [1, 2, 3],
                'B': [1, 2, 3],
                'C': [1, 2, 3],
                'D': [1, 2, 3],
                'E': None,
            }
    ]
    
    >>> import databricks.koalas as ks
    >>> ks.DataFrame(data_list)
               A          B          C          D     E
    0       None       None       None       None  None
    1  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  [1, 2, 3]  None