Search code examples
pysparkazure-synapse-analyticspyspark-schema

Azure Synapse PySpark - Load Schema from a Schema Definition File


I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process.

I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to ensure the datatypes are corrects.

I want to load a predetermined Schema of each dataset in PySpark, so that I can use the Notebook for more than 1 dataset (parameterised).

I want to be able to create a "Schema File" on the lake, then load it into a Schema object in PySpark and load the dataframe from the files on the lake using that Schema Object.

#schema = LoadFromFile(varSchema)     
df = spark.read.load(varLanding, format='json', schema=dataSchema)
display(df.limit(5))

Solution

  • can this predetermined Schema file be a JSON ?

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("SchemaLoading").getOrCreate()
    
    schema_file_path = "path_to_schema_file.json"
    
    schema = spark.read.json(schema_file_path)
    
    landing_path = "path_to_json_files"
    dataSchema = schema   # Use the loaded schema
    
    df = spark.read.load(landing_path, format="json", schema=dataSchema)