Search code examples
apache-sparkpysparkstructapache-spark-sqlpyspark-schema

How to create dataframe with struct column in PySpark without specifying a schema?


I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API.

The following code (where spark is a spark session):

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
df = spark.createDataFrame(df)
df.printSchema()

gives a map (and does not interpret the array correctly):

root
 |-- data: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- id: long (nullable = true)

I needed a struct. I can force a struct if I give a schema:

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
schema = T.StructType([
    T.StructField('id', LongType()),
    T.StructField('data', StructType([
        StructField('x', T.StringType()),
        StructField('y', T.ArrayType(T.LongType())),
    ]) )
])
df = spark.createDataFrame(df, schema=schema)
df.printSchema()

That indeed gives:

root
 |-- id: long (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: array (nullable = true)
 |    |    |-- element: long (containsNull = true)

But this is too much typing.

Is there any other quick way to create the dataframe so that the data column is a struct without specifying the schema?


Solution

  • When creating an example dataframe, you can use Python's tuples which are transformed into Spark's structs. But this way you cannot specify struct field names.

    df = spark.createDataFrame(
        [(1, ('mplah', [10,20,30])),
         (2, ('mplah2', [100,200,300]))],
        ['id', 'data']
    )
    df.printSchema()
    # root
    #  |-- id: long (nullable = true)
    #  |-- data: struct (nullable = true)
    #  |    |-- _1: string (nullable = true)
    #  |    |-- _2: array (nullable = true)
    #  |    |    |-- element: long (containsNull = true)
    

    Using this approach, you may want to add the schema:

    df = spark.createDataFrame(
        [(1, ('mplah', [10,20,30])),
         (2, ('mplah2', [100,200,300]))],
        'id: bigint, data: struct<x:string,y:array<bigint>>'
    )
    df.printSchema()
    # root
    #  |-- id: long (nullable = true)
    #  |-- data: struct (nullable = true)
    #  |    |-- x: string (nullable = true)
    #  |    |-- y: array (nullable = true)
    #  |    |    |-- element: long (containsNull = true)
    

    However, I often prefer a method using struct. This way detailed schema is not provided and struct field names are taken from column names.

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [(1, 'mplah', [10,20,30]),
         (2, 'mplah2', [100,200,300])],
        ['id', 'x', 'y']
    )
    df = df.select('id', F.struct('x', 'y').alias('data'))
    
    df.printSchema()
    # root
    #  |-- id: long (nullable = true)
    #  |-- data: struct (nullable = false)
    #  |    |-- x: string (nullable = true)
    #  |    |-- y: array (nullable = true)
    #  |    |    |-- element: long (containsNull = true)