Search code examples
pythonpysparksubstring

remove last few characters in PySpark dataframe column


I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -

from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()

+--------------+
|          name|
+--------------+
|     rose_2012|
|  jasmine_2013|
|     lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+

I want to create 2 columns, the flower and year column.

Expected output:

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+

year column I have created -

df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
|          name|year|
+--------------+----+
|     rose_2012|2012|
|  jasmine_2013|2013|
|     lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+

I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length, but that doesn't work.

df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))

How can I create flower column with only flower names?


Solution

  • You can use expr function

    >>> from pyspark.sql.functions import substring, length, col, expr
    >>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
    >>> df.show()
    +--------------+----+---------+
    |          name|year|   flower|
    +--------------+----+---------+
    |     rose_2012|2012|     rose|
    |  jasmine_2013|2013|  jasmine|
    |     lily_2014|2014|     lily|
    | daffodil_2017|2017| daffodil|
    |sunflower_2016|2016|sunflower|
    +--------------+----+---------+