I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process.
I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to ensure the datatypes are corrects.
I want to load a predetermined Schema of each dataset in PySpark, so that I can use the Notebook for more than 1 dataset (parameterised).
I want to be able to create a "Schema File" on the lake, then load it into a Schema object in PySpark and load the dataframe from the files on the lake using that Schema Object.
#schema = LoadFromFile(varSchema)
df = spark.read.load(varLanding, format='json', schema=dataSchema)
display(df.limit(5))
can this predetermined Schema file be a JSON ?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SchemaLoading").getOrCreate()
schema_file_path = "path_to_schema_file.json"
schema = spark.read.json(schema_file_path)
landing_path = "path_to_json_files"
dataSchema = schema # Use the loaded schema
df = spark.read.load(landing_path, format="json", schema=dataSchema)