I have a Spark DataFrame
named df
, as shown in the picture above. There is a column named region
that is currently has only the value NE
. There is a separate column named address
with many addresses. I want to change the observations of region
to VA
where address
ends with VA
. How can I do this? Either pyspark
or sparkr
commands would work.
the below snippet should work. This method converts the Dataframe to RDD and performs a map operation to change region appropriately.
>>> df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA", "NE")]).toDF(["address", "region"])
>>> df.rdd.map(lambda (x,y): (x,'VA' if x.endswith('VA') else y)).toDF(["address", "region"]).show()
+------------+------+
| address|region|
+------------+------+
| NE| NE|
| Luray, VA| VA|
| VA| VA|
|Richmond, VA| VA|
+------------+------+
the approach without having convert dataframe to rdd would look like the below. Note: this approach is preferable over the rdd approach since it is more performant and makes less assumptions about the schema.
from pyspark.sql.functions import udf
df = sc.parallelize([("NE","NE"), ("Luray, VA", "NE"), ("VA" ,"NE"), ("Richmond, VA", "NE")]).toDF(["address", "region"])
regionfunc = udf(lambda x: 'VA' if x.endswith('VA') else x)
df.withColumn('region', regionfunc(df.address)).show()
+------------+------+
| address|region|
+------------+------+
| NE| NE|
| Luray, VA| VA|
| VA| VA|
|Richmond, VA| VA|
+------------+------+