Search code examples
pythonpython-2.7apache-sparkpysparkregexp-replace

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?


Notice: this is for Spark version 2.1.1.2.6.1.0-129

I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.

The following is some code that I have written:

my_df = my_df.selectExpr(
        " regexp_replace(states, '^-?[0-9]+$', '') AS states ",
        "someOtherColumn")

This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.


Solution

  • I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr

    from pyspark.sql import functions as F
    
    (df
        .withColumn('states', F
            .when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
            .otherwise(F.col('states'))
        )
        .show()
    )
    
    # Output
    # +----------+------------+
    # |    states|states_fixed|
    # +----------+------------+
    # |  Illinois|    Illinois|
    # |        12|        null|
    # |California|  California|
    # |        01|        null|
    # |    Nevada|      Nevada|
    # +----------+------------+