Convert my_dict into a PySpark dataframe. The starting dictionary looks like:
my_dict = {'z': 'some_string', 'y':'some_other_string', 'a': [1,2,3], 'b': [4,5,6]}
Easy enough to do in Python using my_dict.from_dict
, but I guess this isn't Kansas no more! Haven't found a straight way in PySpark to accomplish this.
The resulting PySpark DataFrame should be represented as:
'a' 'b' 'z' 'y'
1 4 'some_string' 'some_other_string'
2 5 'some_string' 'some_other_string'
3 6 'some_string' 'some_other_string'
If we want to pass data to spark.createDataFrame
, we'll want all row lists to be the same length. We can extract the data from your dictionary in the following way:
n_rows = max([len(vals) if isinstance(vals, list) else 1 for vals in my_dict.values()])
# n_rows = 3
col_names = list(my_dict.keys())
data = [vals if len(vals) == n_rows else [vals]*n_rows for _,vals in my_dict.items()]
# [['some_string', 'some_string', 'some_string'],
# ['some_other_string', 'some_other_string', 'some_other_string'],
# [1, 2, 3],
# [4, 5, 6]]
# transpose data
pyspark_data = list(zip(*data))
# pyspark data
# [('some_string', 'some_other_string', 1, 4),
# ('some_string', 'some_other_string', 2, 5),
# ('some_string', 'some_other_string', 3, 6)]
Then we can create the pyspark dataframe:
df = spark.createDataFrame(
pyspark_data,
col_names
)
+-----------+-----------------+---+---+
| z| y| a| b|
+-----------+-----------------+---+---+
|some_string|some_other_string| 1| 4|
|some_string|some_other_string| 2| 5|
|some_string|some_other_string| 3| 6|
+-----------+-----------------+---+---+