I have a pyspark dataframe with an infered schema that looks like the below. How would I define this schema in pyspark?
root
|-- active: string (nullable = true)
|-- activeText: string (nullable = true)
|-- addOns: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- addOnID: string (nullable = true)
| | |-- amount: string (nullable = true)
| | |-- category: string (nullable = true)
| | |-- code: string (nullable = true)
| | |-- creditTo: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- productID: string (nullable = true)
| | |-- quantity: string (nullable = true)
| | |-- subscriptionID: string (nullable = true)
| | |-- taxable: string (nullable = true)
|-- addedBy: string (nullable = true)
I got this far but wasn't sure how to deal with the array.
schema = StructType(
[
StructField("active", IntegerType(), True),
StructField("activeText", StringType(), True),
...
StructField("addedBy", IntegerType(), True),
]
Thanks!
you can check ArrayType with examples.
For you case:
schema = StructType([
StructField("active", StringType(), True),
StructField("activeText", StringType(), True),
StructField("addOns", ArrayType(StructType([
StructField("addOnID", StringType(), True),
StructField("amount", StringType(), True),
StructField("category", StringType(), True),
StructField("code", StringType(), True),
StructField("creditTo", StringType(), True),
StructField("description", StringType(), True),
StructField("productID", StringType(), True),
StructField("quantity", StringType(), True),
StructField("subscriptionID", StringType(), True),
StructField("taxable", StringType(), True)
]), True)),
StructField("addedBy", StringType(), True)
])