python python-2.7 apache-spark pyspark regexp-replace

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

Notice: this is for Spark version 2.1.1.2.6.1.0-129

I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.

The following is some code that I have written:

my_df = my_df.selectExpr(
        " regexp_replace(states, '^-?[0-9]+$', '') AS states ",
        "someOtherColumn")

This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.

Solution

I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr

from pyspark.sql import functions as F

(df
    .withColumn('states', F
        .when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
        .otherwise(F.col('states'))
    )
    .show()
)

# Output
# +----------+------------+
# |    states|states_fixed|
# +----------+------------+
# |  Illinois|    Illinois|
# |        12|        null|
# |California|  California|
# |        01|        null|
# |    Nevada|      Nevada|
# +----------+------------+