Search code examples
ignite

Best practice to do join in Apache Ignite


I have two large tables A and B, and I want to join these two tables on two columns, say, project_id and customer_id.

When I do the join in Apache Ignite, I find that the performance is very bad. After investigating, I think the problem lies in that the data recide in different nodes randomly.

When the join happens, there are data transfer between nodes to make the same project_id and customer_id from A and B into same node.

For my case,

  1. Load data into the Ignite cluster based on A and B's project_id and customer_id, so that, there is no data transfer when doing the join. The solution can work but not flexible.
  2. Use only one node to hold all the data. This solution can work but there is memory limit for a single node(Not too much data can be held by one node)

I would ask which solution would be a better choice, thanks!


Solution

  • The former (1.) is recommended. You should load the data in the fashion so that data for the same project_id and customer_id is on the same node in both tables.

    This is called affinity collocation and it is paramount to get right to have good performance of Ignite queries (and sometimes for them to work at all).

    Ignite will take care of affinity collocation for you if you setup it correctly, but there are a few caveats right away:

    • Affinity key has to be a part of primary key (not a value field)
    • Affinity key has to be single (so you have to choose between project_id and customer_id) or a composite type (a nested POJO with its own implications) or a synthetic value maybe?
    • There is possibility of uneven partition distribution. Imagine you have a single large customer (or project). When processing this customer, all nodes but a single one will be idle and unused.