I have dataframe like this
Row(id='123456', name='Computer Science', class='Science')
and i have like 1000 rows in dataframe.
Now i have function like
def parse_id(id):
id = somestuff
return new_id
for every column i have parse function for that like parse_name
, parse_class
I want to apply those functions to each dataframe row so that it gives new column like new_id
, 'new_name', 'new_class'
So the resultant dataframe will be like
Row(id='123456', name='Computer Science', class='Science', new_id='12345668688', new_name='Computer Science new', new_class='Science new')
How can i do that
I'd suggest for you to go through the concepts of UDFs in Spark, f.e. this blog post https://changhsinlee.com/pyspark-udf/ has the concept described quite well with enough examples as well.
To your problem, let's assume your input dataframe is in variable df
, then this code should solve your problem:
import pyspark.sql.functions as f
import pyspark.sql.types as t
parse_id_udf = f.udf(parse_id, t.StringType())
parse_name_udf = f.udf(parse_name, t.StringType())
parse_class_udf = f.udf(parse_class, t.StringType())
result_df = df.select(f.col("id"), f.col("name"), f.col("class"),
parse_id_udf(f.col("id")).alias("new_id"),
parse_name_udf(f.col("name")).alias("new_name"),
parse_class_udf(f.col("class")).alias("new_class"))