Search code examples
dataframeapache-sparkpysparkschemaparquet

Passing schema to construct DataFrame


I'm working on migrating a job. As a part of this, I need to pass column datatypes when constructing dataframes.

I was able to construct a dictionary which holds table name as key and schema definition for the table as value. When I'm trying to pass the values to dataframe schema it is complaining that it should be Struct type, but not string type.

Dictionary I'm creating:

{'table1': StructType([
    StructField("column1",varchar(), True),
    StructField("column2",numeric(), True),
    StructField("column3",numeric(), True),
    StructField("column4",timestamp(), True),
    StructField("column5",timestamp(), True),
    StructField("column6",timestamp(), True)
])}

I'm aware the datatypes up there can be wrongly placed, but this is just an example.

Error: expecting a Struct not a string literal for schema definition.


Solution

  • I'm not sure how you use your dictionary, but the following way of passing schema as dict value works well:

    my_dict = {'table1': StructType([
        StructField("column1", StringType(), True),
        StructField("column2", LongType(), True),
        StructField("column3", LongType(), True),
        StructField("column4", TimestampType(), True),
        StructField("column5", TimestampType(), True),
        StructField("column6", TimestampType(), True)
    ])}
    
    df = spark.createDataFrame([], my_dict["table1"])
    
    df.printSchema()
    # root
    #  |-- column1: string (nullable = true)
    #  |-- column2: long (nullable = true)
    #  |-- column3: long (nullable = true)
    #  |-- column4: timestamp (nullable = true)
    #  |-- column5: timestamp (nullable = true)
    #  |-- column6: timestamp (nullable = true)