Search code examples
apache-sparkpyspark

PySpark transform multiple columns into a single column complex json


I have a flat input dataframes object with the following columns:

col_a: string
col_b: string
col_c: int
col_d: boolean

I want to transform and load this into an output file. I want my output file to be rows of json. The structure of each json row is as follows (indented to make it easy to read, but when it's stored it should just be 1 line):

{
  "result": [
     {
         "pair": [
           {
              "a": {col_a}
              "b": {col_b}
              ...
           },
           {
               "c": {col_c}
               "d": {col_d}
           }
         ]
     }
  ]
}

When these json rows are stored, they should look like:

{ "result": [{ "pair" ... }]} \n
{ "result": [{ "pair" ... }]}\n

Each json row object is sufficiently complex. What's the best way to implement transform + load for this use case?


Solution

  • You can express the required nested structure against the flat columns. Note that all elements in array must have the same structure.

    _data = [
        ('val_a', 'val_b', 42, True),
        ('val_a', 'val_b', 43, False),
    ]
    _schema = ['col_a', 'col_b', 'col_c', 'col_d']
    df = spark.createDataFrame(_data, _schema)
    
    pair = F.struct(
        F.col('col_a').alias('a'),
        F.col('col_b').alias('b'),
        F.col('col_c').alias('c'),
        F.col('col_d').alias('d')
    )
    pair = F.struct(F.array(pair, pair).alias('pair'))
    result = F.array(pair).alias('result')
    
    tdf = df.select(result)
    tdf.write.format('json').save('<some-location>')