Search code examples
apache-sparkpysparkaws-glue

Concat 1 to n items into new spark column


I try to have a dynamic concat of fields, based on some configuration settings the goal is to have a new fields with merged values of 1 to n fields.

language = "JP;EN"
language = list(str(item) for item in language.split(";"))
no_langs = len(language)

# check if columns for multi-language exists
for lang in language:
   doc_lang = "doctor.name_" + lang
   if doc_lang not in case_df.columns:
      case_df_final = AddColumn(case_df, doc_lang)

### combine translations of masterdata
case_df = case_df.withColumn(
   "doctor",
    F.concat(
       F.col(("doctor.name_" + language[0])),
       F.lit(" // "),
       F.col(("doctor.name_" + language[1])),
  ),
)

What I would like to achieve is that the new column is dynamic depending of the amount of languages configured. E.g. If only one language is used the result would be like this.

case_df = case_df.withColumn(
   "doctor",
    F.col(("doctor.name_" + lang[0]))
)

For 2 languages or more it should pick all the languages based on the order in the list. Thanks for your help. I am using Spark 2.4. with Python 3

The expected output would be the following

enter image description here


Solution

  • Final working code is the following:

    # check if columns for multi-language exists
    for lang in language:
        doc_lang = "doctor.name_" + lang
        if doc_lang not in case_df.columns:
            case_df = AddColumn(case_df, doc_lang)
        doc_lang_new = doc_lang.replace(".", "_")
        case_df = case_df.withColumnRenamed(doc_lang, doc_lang_new)
    
    doc_fields = list(map(lambda k: "doctor_name_" + k, language))
    case_df = case_df.withColumn("doctor", F.concat_ws(" // ", *doc_fields))
    

    Thanks all for the help and hints.