pyspark azure-synapse-analytics pyspark-schema

Azure Synapse PySpark - Load Schema from a Schema Definition File

I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process.

I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to ensure the datatypes are corrects.

I want to load a predetermined Schema of each dataset in PySpark, so that I can use the Notebook for more than 1 dataset (parameterised).

I want to be able to create a "Schema File" on the lake, then load it into a Schema object in PySpark and load the dataframe from the files on the lake using that Schema Object.

#schema = LoadFromFile(varSchema)     
df = spark.read.load(varLanding, format='json', schema=dataSchema)
display(df.limit(5))

Solution

can this predetermined Schema file be a JSON ?

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaLoading").getOrCreate()

schema_file_path = "path_to_schema_file.json"

schema = spark.read.json(schema_file_path)

landing_path = "path_to_json_files"
dataSchema = schema   # Use the loaded schema

df = spark.read.load(landing_path, format="json", schema=dataSchema)