Suppose I have two columns (StringType):
JSON 1 | JSON 2 |
---|---|
{"key1":"value1"} | {"key2":"value2"} |
{"key3":"value3"} | {"key4":"value4"} |
I need to concat these columns to have a LIST of JSON in the third column (also string), something like this:
JSON 1 | JSON 2 | JSON 3 |
---|---|---|
{"key1":"value1"} | {"key2":"value2"} | [{"key1":"value1"},{"key2":"value2"}] |
{"key3":"value3"} | {"key4":"value4"} | [{"key3":"value3"},{"key4":"value4"}] |
Actually I solved the problem, but I don't think it's the best approach:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, concat, concat_ws, lit
def concat_jsons(df: DataFrame, columns: list):
df = df.withColumn(
'JSON 3', concat_ws(',', *columns)
)
return df.withColumn(
'JSON 3',
concat(lit('['), col('JSON 3'), lit(']'))
)
Anyone has a better idea?
data =[["""{"key1":"value1"}""", """{"key2":"value2"}"""], ["""{"key3":"value3"}""","""{"key4":"value4"}"""] ]
df = spark.createDataFrame(data).toDF('JSON 1','JSON 2')
df.show()
+-----------------+-----------------+
| JSON 1| JSON 2|
+-----------------------------------+
|{"key1":"value1"}|{"key2":"value2"}|
|{"key3":"value3"}|{"key4":"value4"}|
+-----------------+-----------------+
def concat_jsons(df: DataFrame, columns: list):
... return df.withColumn('JSON 3',array(*columns))
concat_jsons(df, ['JSON 1','JSON 2']).show()
+-----------------+-----------------+--------------------+
| JSON 1| JSON 2| JSON 3|
+-----------------+-----------------+--------------------+
|{"key1":"value1"}|{"key2":"value2"}|[{"key1":"value1"...|
|{"key3":"value3"}|{"key4":"value4"}|[{"key3":"value3"...|
+-----------------+-----------------+--------------------+
if you want "JSON 3" to be for StringType, cast it to string
def concat_jsons(df: DataFrame, columns: list):
... return df.withColumn('JSON 3',array(*columns).cast('string'))