I am just a little confused on how to create the spark udf. I have right now a function parse_xml
and do the following:
spark.udf.register("parse_xml_udf", parse_xml)
parsed_df = xml_df.withColumn("parsed_xml", parse_xml_udf(xml_df["raw_xml"]))
where xml_df
is the original spark df and raw_xml
is the column I want to apply the function on.
I have seen a few places a line like spark_udf = udf(parse_xml, StringType())
-- what is the difference between this and the spark.udf.register
line? Additionally, if I apply the function to that one column, is it applying it to each row? In other words, should my UDF be returning the output for one single row?
This spark.udf.register("squaredWithPython", squared)
if you want to use with SQL like this: %sql select id, squaredWithPython(id) as id_squared from test
This squared_udf = udf(squared, LongType())
if just with data frame usage like this: display(df.select("id", squared_udf("id").alias("id_squared")))
That's all, but things not always clearly explained in the manuals.