Search code examples
hadoopdistributed-computingapache-nifi

QueryDatabaseTable Nifi Processor fetches duplicate rows from mysql database


QueryDatabaseTable is fetching rows from Mysql table twice on a 2 node cluster. When Execution setting is configured to all nodes, the fetching process itself is not distributed. Each node fetch similar data which isn't the ideal output I need. However changing it to primary node, it works fine, but a single node is burdened the whole process of fetching data which defeats the point of distributed computing. Is there a workaround to this ?


Solution

  • QueryDatabaseTable is designed to run on the Primary Node only with one task only, it does a single fetch and is not a distributed solution. For parallel/distributed fetching in a cluster, you will want the following:

    GenerateTableFetch -> RemoteProcessGroup -> Input Port -> ExecuteSQL

    GenerateTableFetch should be set to execute on the Primary Node only with one task. It does the "first half" of what QueryDatabaseTable does, by generating SQL statements to grab batches of rows of a specified size. But it does not do the fetch.

    Instead you send the SQL statements to a RemoteProcessGroup (RPG) which points to an Input Port on the same cluster. This will distribute the SQL statements among the nodes in the NiFi cluster.

    The ExecuteSQL processor on each node will get a subset of the SQL statements to execute, thereby performing the fetch in parallel across the cluster. Note that the remainder of this flow will execute in parallel, you won't be able to join up the results later, but it sounds like that's not what you want anyway.