Search code examples
pysparkapache-spark-sqlspark-avro

How to assign constant values to the nested objects in pyspark?


I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed. This is the schema where I need some changes on the fields(answer_type,response0, response3):

|    |-- choices: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- choice_id: long (nullable = true)
 |    |    |    |-- created_time: long (nullable = true)
 |    |    |    |-- updated_time: long (nullable = true)
 |    |    |    |-- created_by: long (nullable = true)
 |    |    |    |-- updated_by: long (nullable = true)
 |    |    |    |-- answers: struct (nullable = true)
 |    |    |    |    |-- answer_node_internal_id: long (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |    |    |    |    |-- text: map (nullable = true)
 |    |    |    |    |    |-- key: string
 |    |    |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |    |    |-- data_tag: string (nullable = true)
 |    |    |    |    |-- answer_type: string (nullable = true)
 |    |    |    |-- response: struct (nullable = true)
 |    |    |    |    |-- response0: string (nullable = true)
 |    |    |    |    |-- response1: long (nullable = true)
 |    |    |    |    |-- response2: double (nullable = true)
 |    |    |    |    |-- response3: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

Is there a way I could assign values to those fields without affecting the above structure in pyspark?

I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.


Solution

  • oh i got a similar problem days ago, i suggest to transform the structype to json and then with a udf you can make the internal changes and after you cant get the original struct again

    you should see to_json and from_json from documentation.

    https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json

    https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json