I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API.
The following code (where spark
is a spark session):
import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
{'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
df = spark.createDataFrame(df)
df.printSchema()
gives a map (and does not interpret the array correctly):
root
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- id: long (nullable = true)
I needed a struct. I can force a struct if I give a schema:
import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
{'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
schema = T.StructType([
T.StructField('id', LongType()),
T.StructField('data', StructType([
StructField('x', T.StringType()),
StructField('y', T.ArrayType(T.LongType())),
]) )
])
df = spark.createDataFrame(df, schema=schema)
df.printSchema()
That indeed gives:
root
|-- id: long (nullable = true)
|-- data: struct (nullable = true)
| |-- x: string (nullable = true)
| |-- y: array (nullable = true)
| | |-- element: long (containsNull = true)
But this is too much typing.
Is there any other quick way to create the dataframe so that the data column is a struct without specifying the schema?
When creating an example dataframe, you can use Python's tuples which are transformed into Spark's structs. But this way you cannot specify struct field names.
df = spark.createDataFrame(
[(1, ('mplah', [10,20,30])),
(2, ('mplah2', [100,200,300]))],
['id', 'data']
)
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- data: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: array (nullable = true)
# | | |-- element: long (containsNull = true)
Using this approach, you may want to add the schema:
df = spark.createDataFrame(
[(1, ('mplah', [10,20,30])),
(2, ('mplah2', [100,200,300]))],
'id: bigint, data: struct<x:string,y:array<bigint>>'
)
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- data: struct (nullable = true)
# | |-- x: string (nullable = true)
# | |-- y: array (nullable = true)
# | | |-- element: long (containsNull = true)
However, I often prefer a method using struct
. This way detailed schema is not provided and struct field names are taken from column names.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'mplah', [10,20,30]),
(2, 'mplah2', [100,200,300])],
['id', 'x', 'y']
)
df = df.select('id', F.struct('x', 'y').alias('data'))
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- data: struct (nullable = false)
# | |-- x: string (nullable = true)
# | |-- y: array (nullable = true)
# | | |-- element: long (containsNull = true)