Search code examples
pysparksparkr

Filtering and reassigning based on regex in Spark


enter image description here

I have a Spark DataFrame named df, as shown in the picture above. There is a column named region that is currently has only the value NE. There is a separate column named address with many addresses. I want to change the observations of region to VA where address ends with VA. How can I do this? Either pyspark or sparkr commands would work.


Solution

  • the below snippet should work. This method converts the Dataframe to RDD and performs a map operation to change region appropriately.

    >>> df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA",  "NE")]).toDF(["address", "region"])
    >>> df.rdd.map(lambda (x,y): (x,'VA' if x.endswith('VA') else y)).toDF(["address", "region"]).show()
    +------------+------+
    |     address|region|
    +------------+------+
    |          NE|    NE|
    |   Luray, VA|    VA|
    |          VA|    VA|
    |Richmond, VA|    VA|
    +------------+------+
    

    the approach without having convert dataframe to rdd would look like the below. Note: this approach is preferable over the rdd approach since it is more performant and makes less assumptions about the schema.

    from pyspark.sql.functions import udf
    df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA",  "NE")]).toDF(["address", "region"])
    regionfunc = udf(lambda x: 'VA' if x.endswith('VA') else x)
    df.withColumn('region', regionfunc(df.address)).show()
    
    +------------+------+
    |     address|region|
    +------------+------+
    |          NE|    NE|
    |   Luray, VA|    VA|
    |          VA|    VA|
    |Richmond, VA|    VA|
    +------------+------+