I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed. This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.
oh i got a similar problem days ago, i suggest to transform the structype to json and then with a udf you can make the internal changes and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json