Search code examples
javagoogle-cloud-dataflowapache-beam

Does SortValues transform Java SDK extension in Beam only run in hadoop environment?


I have tried the example code of SortValues transform using DirectRunner on local machine (Windows)

PCollection<KV<String, KV<String, Integer>>> input = ...

PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped =
input.apply(GroupByKey.<String, KV<String, Integer>>create());

PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
grouped.apply(SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));

but I got the error PipelineExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable. Does this mean this transform function only works in Hadoop environment?


Solution

  • As of today, if you use Beam with release version below 2.0.0, you will have to add two hadoop dependencies in your maven pom file for this SortValues module to work.

    1. add hadoop-common version 2.7.3 or later
    2. add hadoop-mapreduce-client-core version 2.7.3 or later.

    Otherwise, you will just need to use Beam with release version >= 2.0.0.