dictionary apache-spark pyspark apache-spark-sql

How do I convert a Python dictionary, with a list of values for key, so that each value in the list is in its own row

Convert my_dict into a PySpark dataframe. The starting dictionary looks like:

my_dict = {'z': 'some_string', 'y':'some_other_string', 'a': [1,2,3], 'b': [4,5,6]}

Easy enough to do in Python using my_dict.from_dict, but I guess this isn't Kansas no more! Haven't found a straight way in PySpark to accomplish this.

The resulting PySpark DataFrame should be represented as:

'a'       'b'     'z'           'y'
 1         4  'some_string'  'some_other_string'
 2         5  'some_string'  'some_other_string'
 3         6  'some_string'  'some_other_string'

Solution

If we want to pass data to spark.createDataFrame, we'll want all row lists to be the same length. We can extract the data from your dictionary in the following way:

n_rows = max([len(vals) if isinstance(vals, list) else 1 for vals in my_dict.values()])
# n_rows = 3

col_names = list(my_dict.keys())

data = [vals if len(vals) == n_rows else [vals]*n_rows for _,vals in my_dict.items()]
# [['some_string', 'some_string', 'some_string'],
#  ['some_other_string', 'some_other_string', 'some_other_string'],
#  [1, 2, 3],
#  [4, 5, 6]]

# transpose data
pyspark_data = list(zip(*data))

# pyspark data
# [('some_string', 'some_other_string', 1, 4),
#  ('some_string', 'some_other_string', 2, 5),
#  ('some_string', 'some_other_string', 3, 6)]

Then we can create the pyspark dataframe:

df = spark.createDataFrame(
    pyspark_data,
    col_names
)

+-----------+-----------------+---+---+
|          z|                y|  a|  b|
+-----------+-----------------+---+---+
|some_string|some_other_string|  1|  4|
|some_string|some_other_string|  2|  5|
|some_string|some_other_string|  3|  6|
+-----------+-----------------+---+---+