Search code examples
pythonpyspark

PySpark: Why does using F.expr work but using PySpark API does not


I have this line of code:

df = df.withColumn("final_name", F.substring(F.col("name"), 1, F.length(F.col("name"))-15))

When I run it I get the error Column is not iterable (Something with length is causing the issue). However, when I use the equivalent code with F.expr(), it works. Why is that?

df = df.withColumn("final_name", F.expr("substring(name, 1, length(name)-15)"))

This is really more for my own education on why my original code doesn't work. Thanks for your help.


Solution

  • substring(str: ColumnOrName, pos: int, len: int) function is for static (hardcoded int values).

    Use substr(str: ColumnOrName, pos: ColumnOrName, len: Optional[ColumnOrName] = None) if you want it to be calculated.

    df = df.withColumn("final_name", F.substr(F.col("name"), 1, F.length(F.col("name"))-15))
    

    Both functions are available under pyspark.sql.functions as well as pyspark.sql.column.Column.

    So these do the same:

    df.withColumn("final_name", df.name.substr(F.lit(1), F.length(df.name)-15))
    df.withColumn("final_name", F.col("name").substr(F.lit(1), F.length(F.col("name"))-15))