Search code examples
pythondataframepysparkapache-spark-sqlamazon-emr

How can i add multiple columns to existing dataframe in pyspark aws emr?


I have dataframe like this

Row(id='123456', name='Computer Science', class='Science')

and i have like 1000 rows in dataframe.

Now i have function like

def parse_id(id):
    id = somestuff
    return new_id

for every column i have parse function for that like parse_name , parse_class

I want to apply those functions to each dataframe row so that it gives new column like new_id, 'new_name', 'new_class'

So the resultant dataframe will be like

Row(id='123456', name='Computer Science', class='Science', new_id='12345668688', new_name='Computer Science new', new_class='Science new')

How can i do that


Solution

  • I'd suggest for you to go through the concepts of UDFs in Spark, f.e. this blog post https://changhsinlee.com/pyspark-udf/ has the concept described quite well with enough examples as well.

    To your problem, let's assume your input dataframe is in variable df, then this code should solve your problem:

    import pyspark.sql.functions as f
    import pyspark.sql.types as t
    
    parse_id_udf = f.udf(parse_id, t.StringType())
    parse_name_udf = f.udf(parse_name, t.StringType())
    parse_class_udf = f.udf(parse_class, t.StringType())
    
    result_df = df.select(f.col("id"), f.col("name"), f.col("class"),
                          parse_id_udf(f.col("id")).alias("new_id"),
                          parse_name_udf(f.col("name")).alias("new_name"),
                          parse_class_udf(f.col("class")).alias("new_class"))