Search code examples
dictionaryapache-sparkpysparkapache-spark-sql

How do I convert a Python dictionary, with a list of values for key, so that each value in the list is in its own row


Convert my_dict into a PySpark dataframe. The starting dictionary looks like:

my_dict = {'z': 'some_string', 'y':'some_other_string', 'a': [1,2,3], 'b': [4,5,6]}

Easy enough to do in Python using my_dict.from_dict, but I guess this isn't Kansas no more! Haven't found a straight way in PySpark to accomplish this.

The resulting PySpark DataFrame should be represented as:

'a'       'b'     'z'           'y'
 1         4  'some_string'  'some_other_string'
 2         5  'some_string'  'some_other_string'
 3         6  'some_string'  'some_other_string'

Solution

  • If we want to pass data to spark.createDataFrame, we'll want all row lists to be the same length. We can extract the data from your dictionary in the following way:

    n_rows = max([len(vals) if isinstance(vals, list) else 1 for vals in my_dict.values()])
    # n_rows = 3
    
    col_names = list(my_dict.keys())
    
    data = [vals if len(vals) == n_rows else [vals]*n_rows for _,vals in my_dict.items()]
    # [['some_string', 'some_string', 'some_string'],
    #  ['some_other_string', 'some_other_string', 'some_other_string'],
    #  [1, 2, 3],
    #  [4, 5, 6]]
    
    # transpose data
    pyspark_data = list(zip(*data))
    
    # pyspark data
    # [('some_string', 'some_other_string', 1, 4),
    #  ('some_string', 'some_other_string', 2, 5),
    #  ('some_string', 'some_other_string', 3, 6)]
    

    Then we can create the pyspark dataframe:

    df = spark.createDataFrame(
        pyspark_data,
        col_names
    )
    
    +-----------+-----------------+---+---+
    |          z|                y|  a|  b|
    +-----------+-----------------+---+---+
    |some_string|some_other_string|  1|  4|
    |some_string|some_other_string|  2|  5|
    |some_string|some_other_string|  3|  6|
    +-----------+-----------------+---+---+