Search code examples
pythonapache-sparkpysparkapache-spark-sqlcapitalize

How can I capitalize each word delimited by some characters in Pyspark?


I'm currently working with Pyspark and I'm facing a seemingly simple problem.

I want to capitalize the first letter of each word, even if the words are separated by characters in the following list:

delimiter_list = [' ', '(', '+', '/', '-']

Actually, initcap works only for words delimited by blank types.

Is there an efficient solution? Here are some input-output examples:

input output
baden-baden Baden-Baden
markranstadt/brandenburg-kirchmöser Markranstadt/Brandenburg-Kirchmöser
ostrow mazowiecki/bialystok Ostrow Mazowiecki/Bialystok

Solution

  • As the delimiter isn't the same, you can first add a common delimiter, say # after each character in your list delimiter_list using regexp_replace:

    regexp_replace(words, '(\\s|\\(|\\+|-|\\/)(.)', '$1#$2')
    

    Now, you can split by # and transform the resulting array by capitalizing each element using transform function. Finally, join the array elements to get the original string using array_join function:

    from pyspark.sql import functions as F
    
    df1 = df.withColumn(
        "words_capitalized",
        F.expr(r"""
            array_join(
                transform(
                    split(regexp_replace(words, '(\\s|\\(|\\+|-|\\/)(.)', '$1#$2'), '#'),
                    x -> initcap(x)
                ),
                ""
            )
        """)
    )
    
    df1.show(truncate=False)
    
    #+-----------------------------------+-----------------------------------+
    #|words                              |words_capitalized                  |
    #+-----------------------------------+-----------------------------------+
    #|baden-baden                        |Baden-Baden                        |
    #|markranstadt/brandenburg-kirchmöser|Markranstadt/Brandenburg-Kirchmöser|
    #|ostrow mazowiecki/bialystok        |Ostrow Mazowiecki/Bialystok        |
    #+-----------------------------------+-----------------------------------+