Search code examples
pythonpysparkdatasetfieldcapitalize

How do you capitalize just the first letter in PySpark for a dataset? (Simple capitalization/sentence case)


I need to clean several fields: species/description are usually a simple capitalization in which the first letter is capitalized. PySpark only has upper, lower, and initcap (every single word in capitalized) which is not what I'm looking for. https://spark.apache.org/docs/2.0.1/api/python/_modules/pyspark/sql/functions.html

Python has a native capitalize() function which I have been trying to use but keep getting an incorrect call to column.

fields_to_cap = ['species', 'description']

for col_name in fields_to_cap:
    df = df.withColumn(col_name, df[col_name].captilize())

Is there a way to easily capitalize these fields?

To be clear, I am trying to capitalize the data within the fields. Here is an example:

Current: "tHis is a descripTion."

Expected: "This is a description."


Solution

  • You can use a workaround by splitting the first letter and the rest, make the first letter uppercase and lowercase the rest, then concatenate them back

    import pyspark.sql.functions as F
    
    df = spark.createDataFrame([[1, 'rush HouR'],
                                [2, 'kung-Fu Panda'],
                                [3, 'titaniC'],
                                [4, 'the Sixth sense']], schema="id int, title string")
    
    df = df.withColumn('title_capitalize', F.concat(F.upper(F.expr("substring(title, 1, 1)")), 
                                                    F.lower(F.expr("substring(title, 2)"))))
    df.show()
    
    +---+---------------+----------------+
    | id|          title|title_capitalize|
    +---+---------------+----------------+
    |  1|      rush HouR|       Rush hour|
    |  2|  kung-Fu Panda|   Kung-fu panda|
    |  3|        titaniC|         Titanic|
    |  4|the Sixth sense| The sixth sense|
    +---+---------------+----------------+
    

    or you can use a UDF if you want to stick using Python's .capitalize()

    from pyspark.sql.types import StringType
    
    udf_capitalize  = F.udf(lambda x: str(x).capitalize(), StringType())
    
    df = df.withColumn('title_capitalize', udf_capitalize('title'))