scala apache-spark apache-spark-sql gpu rapids

scala rapids using an opaque UDF for a single column dataframe that produces another column

I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.

Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:

val newDf = df.withColumn(colB, opaqueUdf(col("colA")))

Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.

Solution

Yes @Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.

It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.