Search code examples
pythonapache-sparkpyspark

concat a value and Null columns


I have a dataframe having 3 columns.

  • column1 - int
  • column2 - array of struct
  • column3 - array of struct

Im trying to concatenate column2 and column3. However when any one of the column contains null/none, then whole concatenated column becomes null/none, even when one of the other column has value.

In below case, second record ie. id=2 returns null when we concatenate country and reference

+---+--------------------+--------------------+
| id|   country          |    reference       |        
+---+--------------------+--------------------+
|  1|[{US, 2024-01-08}]  | [{UK, 2024-01-08}] |
|  2|[{US, 2024-01-08}]  | NULL / None        |
+---+--------------------+--------------------+

So tried to use coalesce as below. But getting an error : AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `` cannot be resolved.

from pyspark.sql.functions import concat, coalesce, lit

result_df = joined_df.select("id", concat("country", coalesce(col("reference"),"")).alias("segment"))    OR
result_df = joined_df.select("id", concat("country", coalesce("reference","")).alias("segment"))

Used lit along with coalesce as below, getting error AnalysisException:[DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(reference, )" due to data type mismatch: Input to coalesce should all be the same type, but it's ("ARRAY<STRUCT<key: STRING, timestamp: TIMESTAMP>>" or "STRING").;

from pyspark.sql.functions import concat, coalesce, lit

result_df = joined_df.select("id", concat("country",coalesce("reference",lit(""))).alias("segment"))

Desired Result:

+---+--------------------+------------------
| id|   concatenated_column                |        
+---+--------------------+------------------
|  1|[{US, 2024-01-08}, {UK, 2024-01-08}]  |  
|  2|[{US, 2024-01-08}]                    |
+---+--------------------+------------------

Solution

  • Instead of lit("") try array()