Filtering and reassigning based on regex in Spark

I have a Spark DataFrame named df, as shown in the picture above. There is a column named region that is currently has only the value NE. There is a separate column named address with many addresses. I want to change the observations of region to VA where address ends with VA. How can I do this? Either pyspark or sparkr commands would work.

Solution

the below snippet should work. This method converts the Dataframe to RDD and performs a map operation to change region appropriately.

>>> df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA",  "NE")]).toDF(["address", "region"])
>>> df.rdd.map(lambda (x,y): (x,'VA' if x.endswith('VA') else y)).toDF(["address", "region"]).show()
+------------+------+
|     address|region|
+------------+------+
|          NE|    NE|
|   Luray, VA|    VA|
|          VA|    VA|
|Richmond, VA|    VA|
+------------+------+

the approach without having convert dataframe to rdd would look like the below. Note: this approach is preferable over the rdd approach since it is more performant and makes less assumptions about the schema.

from pyspark.sql.functions import udf
df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA",  "NE")]).toDF(["address", "region"])
regionfunc = udf(lambda x: 'VA' if x.endswith('VA') else x)
df.withColumn('region', regionfunc(df.address)).show()

+------------+------+
|     address|region|
+------------+------+
|          NE|    NE|
|   Luray, VA|    VA|
|          VA|    VA|
|Richmond, VA|    VA|
+------------+------+