Search code examples
apache-sparkpyspark

Get data type from a StructType column


I’m reading an Avro file from S3. I’m trying to write it as a delta file. I have the following schema

|--test: struct
   |--test2: struct
      |--test3: struct

When I run:

print(df.schema['test'].dataType)

I got the correct output, but when I run

print(df.schema['test.test2'].dataType)

I got the following error:

'No StructField named test.test2'

I need to get the struct schema because sometimes spark infers that some struct columns are string since they are empty. What I’m trying to do is verify if the column type is a StringType or a StructureType. However, like I said before, I can’t get the data type of a nested structure.

My doubt is: Is it possible to get the datatype of a nested column structure without iterating over it? If not, what is the best approach to iterate?


Solution

  • To access a nested StructType as object, use schema attribute on selection of the target column.

    Sample (assuming having some data):

    from pyspark.sql.types import StructType, StructField, StringType
    
    schema = StructType([
        StructField('name', StructType([
            StructField('firstname', StringType(), True),
            StructField('middlename', StringType(), True),
            StructField('lastname', StringType(), True)
        ])),
        StructField('state', StringType(), True),
        StructField('gender', StringType(), True)
    ])
    
    df = spark.createDataFrame(data=data, schema=schema)
    # df.printSchema()
    print(df.select('name.firstname').schema)
    

    StructType([StructField('firstname', StringType(), True)])
    

    To get the inner data type of a concrete StructField use the following accessing scheme:

    print(df.select('name.firstname').schema['firstname'].dataType)
    

    StringType()