I'm currently working with Pyspark and I'm facing a seemingly simple problem.
I want to capitalize the first letter of each word, even if the words are separated by characters in the following list:
delimiter_list = [' ', '(', '+', '/', '-']
Actually, initcap
works only for words delimited by blank types.
Is there an efficient solution? Here are some input-output examples:
input | output |
---|---|
baden-baden | Baden-Baden |
markranstadt/brandenburg-kirchmöser | Markranstadt/Brandenburg-Kirchmöser |
ostrow mazowiecki/bialystok | Ostrow Mazowiecki/Bialystok |
As the delimiter isn't the same, you can first add a common delimiter, say #
after each character in your list delimiter_list
using regexp_replace
:
regexp_replace(words, '(\\s|\\(|\\+|-|\\/)(.)', '$1#$2')
Now, you can split by #
and transform the resulting array by capitalizing each element using transform
function. Finally, join the array elements to get the original string using array_join
function:
from pyspark.sql import functions as F
df1 = df.withColumn(
"words_capitalized",
F.expr(r"""
array_join(
transform(
split(regexp_replace(words, '(\\s|\\(|\\+|-|\\/)(.)', '$1#$2'), '#'),
x -> initcap(x)
),
""
)
""")
)
df1.show(truncate=False)
#+-----------------------------------+-----------------------------------+
#|words |words_capitalized |
#+-----------------------------------+-----------------------------------+
#|baden-baden |Baden-Baden |
#|markranstadt/brandenburg-kirchmöser|Markranstadt/Brandenburg-Kirchmöser|
#|ostrow mazowiecki/bialystok |Ostrow Mazowiecki/Bialystok |
#+-----------------------------------+-----------------------------------+