Search code examples

Creating/Registering a PySpark UDF and apply it to one column

I am just a little confused on how to create the spark udf. I have right now a function parse_xml and do the following:

spark.udf.register("parse_xml_udf", parse_xml)
parsed_df = xml_df.withColumn("parsed_xml", parse_xml_udf(xml_df["raw_xml"]))

where xml_df is the original spark df and raw_xml is the column I want to apply the function on.

I have seen a few places a line like spark_udf = udf(parse_xml, StringType()) -- what is the difference between this and the spark.udf.register line? Additionally, if I apply the function to that one column, is it applying it to each row? In other words, should my UDF be returning the output for one single row?


    • This spark.udf.register("squaredWithPython", squared) if you want to use with SQL like this: %sql select id, squaredWithPython(id) as id_squared from test

    • This squared_udf = udf(squared, LongType()) if just with data frame usage like this: display("id", squared_udf("id").alias("id_squared")))

    That's all, but things not always clearly explained in the manuals.