Search code examples
pythondataframepysparkcapitalize

How to apply" Initcap" only on records whose values are not all capital letters in a PySpark DataFrame?


I have a PySpark DataFrame and I want to apply "Initcap" on a specific column. However, I want this transformation only on records whose value is not all capitals. For example ,in the sample dataset below, I don't want to apply "Initcap" on USA:

# Prepare Data
data = [(1, "Italy"), \
        (2, "italy"), \
        (3, "USA"), \
        (4, "China"), \
        (5, "china")
  ]
 
# Create DataFrame
columns= ["ID", "Country"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

enter image description here

The expected output will be:

ID Country
1 'Italy'
2 'Italy'
3 'USA'
4 'China'
5 'China'

Solution

  • df.withColumn('Country',when(df.Country==upper(df.Country),df.Country).otherwise( initcap('Country'))).show(truncate=False)
    
    +---+-------+
    |ID |Country|
    +---+-------+
    |1  |Italy  |
    |2  |Italy  |
    |3  |USA    |
    |4  |China  |
    |5  |China  |
    +---+-------+