I am learning about join algorithms in regards to relational data query processing. The simple case is the nested loop join:
function nestedJoin(R, S, compare) {
const out = []
for (const r of R) {
for (const s of S) {
if (compare(r, s)) {
out.push([ r, s ])
}
}
}
return out
}
Where compare
would compare the join attribute.
The case I'm wondering about is the index join. Copying from that cheat sheet into JS sort of, we have:
function indexJoin(R, S) {
const out = []
for (const r of R) {
const X = findInIndex(S.C, r.c)
for (const s of X) {
out.push([ r, s ])
}
}
return out
}
But what is that findInIndex(S.C, r.c)
? What is being passed into it (S.C
)? And how does it work?
The join indices paper says this:
With no indices, two basic algorithms based on sorting [4] and hashing [5, lo] avoid the prohibitive cost of the nested loop method.
A join index is a binary relation. It only contains pairs of surrogates which makes ( it small. However, for generality, we assume that it does not always fit in RAM. Therefore, a join index must be clustered. Since we may need fast access to JI tuples via either r values or s values depending on whether there are selects on relations R or S, a JI should be clustered on (r, s). A simple and uniform solution is to maintain two copies of the JI, one clustered on r and the other clustered on s. Each copy is implemented by a W-tree an efficient variation of the versatile B-tree [l, 71.
So if it were a B+tree, what would the key and value be used in each B+tree, and how would you use the keys and values (in what order do you plugin keys and get values)? Also, cannot the "join index" just be implemented something like this if it were in JavaScript?
const joinIndex = {}
function join(r, s) {
const rest = joinIndex[r.c] = joinIndex[r.c] ?? {}
rest[s.c] = true
}
function findInIndex(leftKey) {
return Object.keys(joinIndex[leftKey])
}
Please show how the join algorithm would be implemented, either with my approach or the B+tree approach. If it is the B+tree approach, you don't need to implement a B+tree, but just explain how you would plug things in/out of the 2 B+trees to explain the algorithm with more clarity.
First of all, the join index that the paper speaks of, can best be imagined as a table that implements a many-to-many relationship between two tables. A record in this join index table consists of two foreign keys: one referencing the primary key in the R table, and another referencing the primary key in the S table.
I didn't get the S.C
notation used in the cheat sheet. But it is clear you'll somehow need to specify which join index to use, and more specifically, which B+Tree (clustering) you want to use on it (in case two of them are defined), and finally, which value (r.c
, the key of r) you want to find in it.
The role of the B+tree is to provide an ordered hash table, i.e. where you can search a key efficiently, and can easily walk from that point to the subsequent entries in order. In this particular use for a join index, this allows you to efficiently find all pairs (r1, s) for a given r1. The first of those would be found by drilling down from the root of the B+tree to the first leaf having r1. Then a walk forward across the bottom layer of the B+tree would find all the other tuples with r1, until a tuple is encountered that no longer has this r1 value.
Note that you still need an index on the original tables as well, in order to find the complete record for a given key. In practice that could also be done with a B+Tree, but in JavaScript, a simple dictionary (plain object) would suffice.
So in JavaScript syntax we could imagine something like this:
// Arguments:
// - joinIndexTree: a B+Tree having (rKey, sKey) tuples, keyed and ordered by rKey.
// - rKey: the key to search all matches for
function findInIndex(joinIndexTree, rKey) {
let result = []; // This will collect all the sKey for
// which thee is a (rKey, sKey)
// Find left-most location in B+Tree where rKey should occur (if present)
let btreeCursor = joinIndexTree.find(rKey);
if (btreeCursor.EOF()) return result; // At the far-right end of the B+Tree
let tuple = btreeCursor.get(); // Read the tuple found at this location
while (tuple[0] == rKey) { // First member of tuple matches rKey
result.push(tuple[1]); // Collect the corresponding s-value
btreeCursor.next(); // Walk to the next tuple
if (btreeCursor.EOF()) break; // At the end of the B+Tree
tuple = btreeCursor.get(); // Read the tuple found at this location
}
return result;
}
The main program would be:
const joinIndexTree = ;// B+Tree with (rKey, sKey) pairs, ordered by rKey
const sIndex = Object.fromEntries(S.map(s => [s.id, s])); // dictionary
function indexJoin(joinIndexTree, R, S, sIndex) {
const out = []
for (const r of R) {
const sids = findInIndex(joinIndexTree, r.id)
for (const s_id of sids) {
const s = sIndex[s_id]; // Look up by primary key index
out.push([ r, s ])
}
}
return out
}
When you only need read-only operations on the table (queries), then instead of a B+Tree, you can create a dictionary of arrays, where you can lookup by joinIndex[r.id]
and get an array of s.id
values. This is certainly easy to set up and work with, but it is a pain to keep updated when the tables are not read-only.
As alternative to B+Tree, you can also use other balanced search trees, such as AVL and red-black trees, but in my experience B+Trees have superior performance.