Search code examples
pythonapache-sparkpysparkschema

pyspark schema mismatch issue


I am trying to load a .csv file into spark using this code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Window


spark = SparkSession.builder.appName('Demo').master('local').getOrCreate()

pathData = '/home/data/departuredelays.csv'

schema = StructType([
            StructField('date', StringType()),
            StructField('delay', IntegerType()),
            StructField('distance', IntegerType()),
            StructField('origin', StringType()),
            StructField('destination', StringType()),
            ])

flightsDelatDf = (spark
                  .read
                  .format('csv')
                  .option('path', pathData)
                  .option('header', True)
                  .option("schema", schema)
                  .load()
                  )

When I check the schema, i see that column delay and distance are shown as type string whereas in the schema, I have defined them as integers

flightsDelatDf.printSchema()

root
 |-- date: string (nullable = true)
 |-- delay: string (nullable = true)
 |-- distance: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)

But if I read the file using .schema(schema) instead of using .option('schema', schema) to specify schema :

flightsDelatDf = (spark
                  .read
                  .format('csv')
                  .option('path', pathData)
                  .option('header', True)
                  .schema(schema)
                  .load()
                  )

I see that the column data types are aligned with what I have specified.

flightsDelatDf.printSchema()

root
 |-- date: string (nullable = true)
 |-- delay: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)

Does anyone why in the first type, the data types are not aligned with the schema defined, whereas they are in the second type? Thanks in advance.


Solution

  • Incorrect Method (option("schema", schema)): Since .option() is not intended to directly analyze and apply a schema object, the StructType schema is not applied in this way. Because .option() designed for straightforward key-value settings, PySpark by default assumes the schema from the CSV file.

    The correct technique is schema(schema); it is especially made to take in a StructType schema and apply it to the DataFrame while it is being read from the CSV file. This guarantees that the columns in the DataFrame have the same data types as those listed in the schema.

    Hope this answers your question