I have a dataframe having 3 columns.
Im trying to concatenate column2 and column3. However when any one of the column contains null/none, then whole concatenated column becomes null/none, even when one of the other column has value.
In below case, second record ie. id=2 returns null when we concatenate country
and reference
+---+--------------------+--------------------+
| id| country | reference |
+---+--------------------+--------------------+
| 1|[{US, 2024-01-08}] | [{UK, 2024-01-08}] |
| 2|[{US, 2024-01-08}] | NULL / None |
+---+--------------------+--------------------+
So tried to use coalesce
as below. But getting an error : AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `` cannot be resolved.
from pyspark.sql.functions import concat, coalesce, lit
result_df = joined_df.select("id", concat("country", coalesce(col("reference"),"")).alias("segment")) OR
result_df = joined_df.select("id", concat("country", coalesce("reference","")).alias("segment"))
Used lit
along with coalesce
as below, getting error AnalysisException:[DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(reference, )" due to data type mismatch: Input to coalesce
should all be the same type, but it's ("ARRAY<STRUCT<key: STRING, timestamp: TIMESTAMP>>" or "STRING").;
from pyspark.sql.functions import concat, coalesce, lit
result_df = joined_df.select("id", concat("country",coalesce("reference",lit(""))).alias("segment"))
Desired Result:
+---+--------------------+------------------
| id| concatenated_column |
+---+--------------------+------------------
| 1|[{US, 2024-01-08}, {UK, 2024-01-08}] |
| 2|[{US, 2024-01-08}] |
+---+--------------------+------------------
Instead of lit("")
try array()