Search code examples
jsonpysparkstructmicrosoft-fabric

Reading JSON files and getting the correct data type: InferShema is giving me problems and setting it to false isn't an option


I'm building a data repository and getting data from a source application's API as JSONs. These JSONs are quite complex and their structure might change from call to call as I'm getting only the changed records daily, after initial load. Using a lakehouse architecture, the goal is to have a single parquet table for each entity where the changed records(deltas) brought daily are appended.

The problem is: Sometimes, there are fields that are usually numbers with decimals (lets say amount payed for the sake of this argument) come in the JSON with value 0 and spark.read.json() infers this field's datatype as LongType. Now this would make appending new data to the existing table an issue when different files make spark infer the DataType of the same filed differently depending on the value of those fields (LongType when the value is 0, DoubleType when it is 0.8 per example)

Faced with this issue, the knee jerk reaction was: Let's just normalize the schema- All the fields that are inferred as long are "cast" as double and there is no more issue with appending new data to the parquet table.

Well, the more experienced among the readers might have laughed at this possible fix, as it comes with a very obvious (in hindsight) problem.... sure, the schema of the existing table and the dataframe (JSON with the new data for the day read as a spark df) I'm trying to append to it now match but trying to put the value 0 in a field that is expecting a Double also blows everything up.

Does anyone have a possible solution for this?


Solution

  • After searching(suffering) a bit more and getting nowhere, I decided to rethink the the whole process and came up with a solution: create the dataframe with the data from the API with the correct schema already- if the JSON is saved in the lakehouse with the intended(correct) schema (by saving the df created from the response with the intended schema "hard-coded" as a parameter) it will be read into a dataframe further in the ETL process with the correct schema just by inferring it, therefore avoiding the problem in the first place.

    Now, the correct way of doing this would be to write the schema from scratch, but since I'm lazy and easily distracted, I did it the "janky" way: create the dataframe with inferSchema = true, copied the text output from print(df.schema), changed what needed to be changed and created a the variable fullSchema with that output hard-coded and passed it as the schema when creating the df in the first place:

    if response.status_code == 200 and len(response.json()) != 0:
            fileToLoad = spark.read.option("multiline","true").schema(fullSchema).json(sc.parallelize([json.dumps(response.json())]))