Search code examples
pysparkdelta-lakepyspark-schema

AttributeError: 'DataFrameWriter' object has no attribute 'schema'


I will like to write a Spark Dataframe with a fix schema. I m trying that:

from pyspark.sql.types import StructType, IntegerType, DateType, DoubleType, StructField


my_schema = StructType([
    StructField("seg_gs_eur_am", DoubleType()),
    StructField("seg_yq_eur_amt", DoubleType()),
    StructField("seg_awd_eur_amt", DoubleType())
])

my_path = "<some_path>"

my_spark_df.write.format("delta").schema(my_schema).save(my_path)

I receive the error:

AttributeError: 'DataFrameWriter' object has no attribute 'schema'

ChatGPT reply me "It looks like you are trying to use the .schema method on a DataFrameWriter object, but this method is not available on that object. Instead, you can specify the schema when you create the DataFrame by using the .schema method on the DataFrameReader object."

But it doesn't make sense for me, because I m pretty sure I could set the schema(years ago) but I dont know and I cannot find now.


Solution

  • As you would have already guessed, you can fix the code by removing .schema(my_schema) like below

    my_spark_df.write.format("delta").save(my_path)

    I think you are confused where does the schema apply, you need to create a dataframe with the schema(use some dummy Seq or rdd), and during that point you need to mention the schema. While you call DataFrameWriter there is no option to provide schema, it infers the schema of the dataframe on which the writer API is called.

    You could take your initial dataframe alter its schema like below and use this intermediate dataframe for the write api call

     df.withColumn("new_column_name",$"old_column_name".cast("new_datatype"))