I've tried to convert a list of dicts into a Databricks' Koalas DataFrame but I keep getting the error message:
ArrowInvalid: cannot mix list and non-list, non-null values
Pandas works perfectly (with pd.DataFrame(list)) but because of company restrictions I must use PySpark/Koalas. I've also tried to convert the list into a dictionary and the error persists.
An example of the list:
[{'A': None,
'B': None,
'C': None,
'D': None,
'E': [],
...},
{'A': data,
'B': data,
'C': data,
'D': data,
'E': None,
...}
]
And the dict is like:
{'A': [None, data, [], [], data],
'B': [None, data, None, [], None],
'C': [None, data, None, [], None],
'D': [None, data, None, [], None],
'E': [[], None, data, [], None]}
Is it possible to get a DataFrame from this? Thanks
You can create a Spark DataFrame using your data without data-manipulation using spark.createDataFrame()
.
sdf = spark.createDataFrame(
data_list,
T.StructType([
T.StructField('A', T.ArrayType(T.IntegerType()), True),
T.StructField('B', T.ArrayType(T.IntegerType()), True),
T.StructField('C', T.ArrayType(T.IntegerType()), True),
T.StructField('D', T.ArrayType(T.IntegerType()), True),
T.StructField('E', T.ArrayType(T.IntegerType()), True),
])
)
Which can then be converted to a Koalas DataFrame using to_koalas()
.
>>> sdf.to_koalas()
A B C D E
0 None None None None []
1 [1, 2, 3] [1, 2, 3] [1, 2, 3] [1, 2, 3] None
Additionally, I was able to create a Koalas DataFrame without going through Spark, by modifying your data so that empty lists []
instead have a value of None
.
data_list = [
{
'A': None,
'B': None,
'C': None,
'D': None,
'E': None,
},
{
'A': [1, 2, 3],
'B': [1, 2, 3],
'C': [1, 2, 3],
'D': [1, 2, 3],
'E': None,
}
]
>>> import databricks.koalas as ks
>>> ks.DataFrame(data_list)
A B C D E
0 None None None None None
1 [1, 2, 3] [1, 2, 3] [1, 2, 3] [1, 2, 3] None