I am new to Rapids and I have trouble understanding the supported operations.
I have data in following format:
+------------+----------+
| kmer|source_seq|
+------------+----------+
|TGTCGGTTTAA$| 4|
|ACCACCACCAC$| 8|
|GCATAATTTCC$| 1|
|CCGTCAAAGCG$| 7|
|CCGTCCCGTGG$| 6|
|GCGCTGTTATG$| 2|
|GAGCATAGGTG$| 5|
|CGGCGGATTCT$| 0|
|GGCGCGAGGGT$| 3|
|CCACCACCAC$A| 8|
|CACCACCAC$AA| 8|
|CCCAAAAAAAAA| 0|
|AAGAAAAAAAAA| 5|
|AAGAAAAAAAAA| 0|
|TGTAAAAAAAAA| 0|
|CCACAAAAAAAA| 8|
|AGACAAAAAAAA| 7|
|CCCCAAAAAAAA| 0|
|CAAGAAAAAAAA| 5|
|TAAGAAAAAAAA| 0|
+------------+----------+
And to I am trying to find out which "kmer"s have which "source_seq"'s, using the following code:
val w = Window.partitionBy("kmer")
x.withColumn("source_seqs", collect_list("source_seq").over(w))
// Result is something like this:
+------------+----------+-----------+
| kmer|source_seq|source_seqs|
+------------+----------+-----------+
|AAAACAAGACCA| 2| [2]|
|AAAACAAGCAGC| 4| [4]|
|AAAACCACGAGC| 3| [3]|
|AAAACCGCCAAA| 7| [7]|
|AAAACCGGTGTG| 1| [1]|
|AAAACCTATATC| 5| [5]|
|AAAACGACTTCT| 6| [6]|
|AAAACGCGCAAG| 3| [3]|
|AAAAGGCCTATT| 7| [7]|
|AAAAGGCGTTCG| 3| [3]|
|AAAAGGCTGTGA| 1| [1]|
|AAAAGGTCTACC| 2| [2]|
|AAAAGTCGAGCA| 7| [7, 0]|
|AAAAGTCGAGCA| 0| [7, 0]|
|AAAATCCGATCA| 0| [0]|
|AAAATCGAGCGG| 0| [0]|
|AAAATCGTTGAA| 7| [7]|
|AAAATGGACAAG| 1| [1]|
|AAAATTGCACCA| 3| [3]|
|AAACACCGCCGT| 3| [3]|
+------------+----------+-----------+
The Spark Rapids supported operators documentation mentions collect_list
being supported only by windowing, which is what I am doing in my code as far as I know.
However, looking at the query plan, it is easy to see that the collect_list
is not executed by the GPU:
scala> x.withColumn("source_seqs", collect_list("source_seq").over(w)).explain
== Physical Plan ==
Window [collect_list(source_seq#302L, 0, 0) windowspecdefinition(kmer#301, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_source#658], [kmer#301]
+- GpuColumnarToRow false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1496]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
Unlike a similar query with different function, where we can see the windowing executed with GPU:
scala> x.withColumn("min_source", min("source_seq").over(w)).explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [gpumin(source_seq#302L) gpuwindowspecdefinition(kmer#301, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS max_source#648L], [kmer#301], false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1431]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
Am I understanding the supported operations documentation wrong somehow, or have I written the code in a wrong way? Any help for this would be appreciated.
Yes as Mithun mentioned, the spark.rapids.sql.expression.CollectList started to be true starting from 0.5 release. However it is false in 0.4 release: https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/docs/configs.md
Here is the plan i tested on 0.5+ version:
val w = Window.partitionBy("name")
val resultdf=dfread.withColumn("values", collect_list("value").over(w))
resultdf.explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [collect_list(value#134L, 0, 0) gpuwindowspecdefinition(name#133, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS values#138], [name#133], false
+- GpuCoalesceBatches RequireSingleBatch
+- GpuSort [name#133 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@28e73bd1
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(name#133, 200), ENSURE_REQUIREMENTS, [id=#563]
+- GpuFileGpuScan csv [name#133,value#134L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string,value:bigint>