SQLite multiple JOIN issue between two tables and then two databases

I am writing an open source pure c program that checks the integrity of files using check-sums, comparing two structurally identical SQLite databases db1 and db2 from different sources.

Each database has two tables: the first one "files" contains the relative file path "relative_path", a check-sum "sha", and the path prefix identifier "path_prefix_index".

CREATE TABLE "files" (
    "ID"    INTEGER NOT NULL,
    "path_prefix_index" INTEGER NOT NULL,
    "relative_path" TEXT NOT NULL,
    "sha"   BLOB DEFAULT NULL,
    PRIMARY KEY("ID"),
    CONSTRAINT "full_path" UNIQUE("path_prefix_index","relative_path") ON CONFLICT FAIL
);
CREATE INDEX full_path_ASC ON files (path_prefix_index, relative_path ASC);

The second table "paths" contains the path prefixes "path" and its ID.

CREATE TABLE "paths" (
    "ID"    INTEGER NOT NULL UNIQUE,
    "path"  TEXT NOT NULL UNIQUE,
    PRIMARY KEY("ID")
);

As could be guessed (based on CONSTRAINT and the index), a unique pair of absolute file paths is composed by the pair paths.path and files.relative_path.

Let's populate the tables in the first database db1 with test data:

INSERT INTO files VALUES(1,1,'AAA/BCB/CCC/a.txt',X'f90d');
INSERT INTO files VALUES(2,1,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(3,1,'AAA/ZAW/D/e/f/b_file.txt',X'856a');
INSERT INTO files VALUES(4,2,'AAA/BCB/CCC/a.txt',X'856a');
INSERT INTO files VALUES(5,2,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(6,2,'AAA/ZAW/D/e/f/b_file.txt',X'856a');

INSERT INTO paths VALUES(1,'/mnt/path1');
INSERT INTO paths VALUES(2,'/mnt/path2');

After that let’s fill in the second database db2 with the same data, but having different indexes. It is necessary to pay attention to the fact that path prefixes in the "paths" table have different indexes and the corresponding "path_prefix_index" fields in the "files" table. Despite the difference in indexes, exactly the same unique pairs of prefix and relative path are formed, which create the absolute path to the file. The both tables in the examples contain the same files (when consider the absolute path point of view).

Now comes the part to notice. In the example above, there is one file for which "sha" checksum differs between databases.

INSERT INTO paths VALUES(4,'/mnt/path1');
INSERT INTO paths VALUES(3,'/mnt/path2');

INSERT INTO files VALUES(1,4,'AAA/BCB/CCC/a.txt',X'f90d');
INSERT INTO files VALUES(2,4,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(3,4,'AAA/ZAW/D/e/f/b_file.txt',X'856a');
INSERT INTO files VALUES(4,3,'AAA/BCB/CCC/a.txt',X'846a'); -- The exactly same file that has a different sha;
INSERT INTO files VALUES(5,3,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(6,3,'AAA/ZAW/D/e/f/b_file.txt',X'856a');

I need to create a query, allowing SQLite to display a pair of path and path_prefix_index (in total, the absolute path to the file) for exactly the same file for which sha field does not match the two databases.

For a long time I have been trying to cope with the multiple JOIN issue, but my knowledge of SQL does not allow to achieve the required result. Here is an example of monsters that I managed to build. Unfortunately this doesn't work:

sqlite> attach database "example1.db" as db1;
sqlite> attach database "example2.db" as db2;
sqlite> SELECT p.path, f1.relative_path
FROM db1.files AS f1
JOIN db1.paths AS p ON f1.path_prefix_index = p.ID
JOIN db2.files AS f2 ON f1.relative_path = f2.relative_path
JOIN db2.paths AS p2 ON f2.path_prefix_index = p2.ID
WHERE f1.sha <> f2.sha;

/mnt/path1|AAA/BCB/CCC/a.txt
/mnt/path2|AAA/BCB/CCC/a.txt
/mnt/path2|AAA/BCB/CCC/a.txt

Everything looks almost great, but it still doesn’t work correctly. In this example there is only one unique file that has a different sha and it is

/mnt/path2|AAA/BCB/CCC/a.txt

I ask to help to understand and build up an SQL query, as a result of which the app comparing two databases will be able to print out the desired absolute path to the file.

Solution

If you add * to show ALL columns e.g. SELECT p.path, f1.relative_path,* then you will see that there are multiple rows that match the WHERE clause:-

e.g.

see below re SQLite tool for the hex data
Remove the WHERE clause and there are 12 rows i.e. you are dealing with the cartesian product of the joined rows.

Assuming that by file you mean a single relative path then you could reduce the 3 to one by adding a GROUP BY f1.relative_path1

e.g.

SELECT p.path, f1.relative_path
FROM db1.files AS f1
JOIN db1.paths AS p ON f1.path_prefix_index = p.ID
JOIN db2.files AS f2 ON f1.relative_path = f2.relative_path
JOIN db2.paths AS p2 ON f2.path_prefix_index = p2.ID
WHERE f1.sha <> f2.sha
GROUP BY f1.relative_path1;

BUT is it

/mnt/path1 .... /mnt/path2, or
/mnt/path2 .... /mnt/path2, or
/mnt/path2 .... /mnt/path1

that you want, if that matters ?

note that GROUP BY may, depending upon the rules, return ambiguous values from with the groupings see https://www.sqlite.org/lang_select.html#resultset

Suggestion

I would suggest considering using an SQLite tool (SQliteStudion, Navicat for SQLite, DBeaver .... ) as it can makes playing around with the SQL much easier.

For instance for the above (and more) Navicat was used along with a single database to remove the complexity of attach (that's working fine).

Here's one permutation of the testing code:-

DROP TABLE IF EXISTS files;
DROP TABLE IF EXISTS paths;
DROP TABLE IF EXISTS files2;
DROP TABLE IF EXISTS paths2;
CREATE TABLE IF NOT EXISTS "files" (
    "ID"    INTEGER NOT NULL,
    "path_prefix_index" INTEGER NOT NULL,
    "relative_path" TEXT NOT NULL,
    "sha"   BLOB DEFAULT NULL,
    PRIMARY KEY("ID"),
    CONSTRAINT "full_path" UNIQUE("path_prefix_index","relative_path") ON CONFLICT FAIL
);
CREATE INDEX IF NOT EXISTS full_path_ASC ON files (path_prefix_index, relative_path ASC);
CREATE TABLE "paths" (
    "ID"    INTEGER NOT NULL UNIQUE,
    "path"  TEXT NOT NULL UNIQUE,
    PRIMARY KEY("ID")
);
CREATE TABLE IF NOT EXISTS "files2" (
    "ID"    INTEGER NOT NULL,
    "path_prefix_index" INTEGER NOT NULL,
    "relative_path" TEXT NOT NULL,
    "sha"   BLOB DEFAULT NULL,
    PRIMARY KEY("ID"),
    CONSTRAINT "full_path" UNIQUE("path_prefix_index","relative_path") ON CONFLICT FAIL
);
CREATE INDEX IF NOT EXISTS full_path_ASC ON files2 (path_prefix_index, relative_path ASC);
CREATE TABLE "paths2" (
    "ID"    INTEGER NOT NULL UNIQUE,
    "path"  TEXT NOT NULL UNIQUE,
    PRIMARY KEY("ID")
);

INSERT INTO files VALUES(1,1,'AAA/BCB/CCC/a.txt',X'f90d');
INSERT INTO files VALUES(2,1,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(3,1,'AAA/ZAW/D/e/f/b_file.txt',X'856a');
INSERT INTO files VALUES(4,2,'AAA/BCB/CCC/a.txt',X'856a');
INSERT INTO files VALUES(5,2,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files VALUES(6,2,'AAA/ZAW/D/e/f/b_file.txt',X'856a');

INSERT INTO paths VALUES(1,'/mnt/path1');
INSERT INTO paths VALUES(2,'/mnt/path2');

INSERT INTO paths2 VALUES(4,'/mnt/path1');
INSERT INTO paths2 VALUES(3,'/mnt/path2');

INSERT INTO files2 VALUES(1,4,'AAA/BCB/CCC/a.txt',X'f90d');
INSERT INTO files2 VALUES(2,4,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files2 VALUES(3,4,'AAA/ZAW/D/e/f/b_file.txt',X'856a');
INSERT INTO files2 VALUES(4,3,'AAA/BCB/CCC/a.txt',X'846a'); -- The exactly same file that has a different sha;
INSERT INTO files2 VALUES(5,3,'AAA/ZAW/A/b/c/a_file.txt',X'16b7');
INSERT INTO files2 VALUES(6,3,'AAA/ZAW/D/e/f/b_file.txt',X'856a');

SELECT p.path, f1.relative_path,f1.sha <> f2.sha AS comprslt, hex(f1.sha) AS h1, hex(f2.sha) AS h2, *  /* to show ALL data of rows that match the criteria */
FROM files AS f1
JOIN paths AS p ON f1.path_prefix_index = p.ID
JOIN files2 AS f2 ON f1.relative_path = f2.relative_path
JOIN paths2 AS p2 ON f2.path_prefix_index = p2.ID
WHERE f1.sha <> f2.sha
/*GROUP BY f1.relative_path*/;

DROP TABLE IF EXISTS files;
DROP TABLE IF EXISTS paths;
DROP TABLE IF EXISTS files2;
DROP TABLE IF EXISTS paths2;

Much copied and pasted, but tables names altered to suit the single DB.
Various columns added so that it is easier to understand what is what.
The GROUP BY clause commented out (simply uncomment)

The above produces:-

the highlighted columns added to see what is going on in regards to the comparison (WHERE clause). Often tools don't handle blobs that well.
- so the result of the comparison itself (obviously would always be 1 if the WHERE clause is used)
- the builtin hex function used to display the blob data in a readable/useful format.
as can be seen it's pretty easy to comment/uncomment parts of the SQL
the final DROPS used to clear up the environment

Additional Re Comment

unfortunately, your example shows the wrong result: /mnt/path1|AAA/BCB/CCC/a.txt But there must be /mnt/path2|AAA/BCB/CCC/a.txt Because only this pair /mnt/path2|AAA/BCB/CCC/a.txt forms the absolute and only path whose sha fields actually do not match.

Perhaps you want the WHERE clause to be WHERE f1.sha <> f2.sha AND p.path = p2.path

You then get:-

or with the additional output columns:-

Alternative CTE approach

If the primary joins between files? and paths? could be done first then the secondary join between the two sets of joins could itself be the equivalent of the WHERE clause.

CTE's can cater for this relatively easily.

A CTE (Common Table Expression) is a temporary table that exists only for the duration of the execution of the statement. You can have multiple CTE's per statement (see https://www.sqlite.org/lang_with.html for more in-depth information).

So an alternative could be to utilise 2 CTEs to do the primary joins and then SELECT from both joined appropriately.

e.g. (using the tables as above)

WITH
    ctedb1 AS (SELECT * FROM files JOIN paths ON path_prefix_index = paths.ID),
    ctedb2 AS (SELECT * FROM files2 JOIN paths2 ON path_prefix_index = paths2.ID)
SELECT 
    ctedb1.path,ctedb1.relative_path,hex(ctedb1.sha) AS sha,ctedb2.path AS otherpath, ctedb2.relative_path AS otherrelativepath, hex(ctedb2.sha) AS othersha
FROM ctedb1
JOIN ctedb2 ON ctedb1.path||ctedb1.relative_path = ctedb2.path||ctedb2.relative_path AND ctedb1.sha <> ctedb2.sha
;

the 2 columns for both sides of the JOIN (secondary) included and the two sha columns (as HEX) (not that they need to be)
note how the secondary join (joining the original 2 ctes) is effectively inclusive of the WHERE clause.

The result being:-