Search code examples
mysqlsqlmergeintersectbooleanquery

Boolean operations on mysql results


I have 3 Mysql tables:

[block_value]

  • id_block_value
  • file_id

[metadata]

  • id_metadata
  • metadata_name

[metadata_value]

  • meta_id
  • value
  • blockvalue_id

In these tables, there are pairs: metadata_name = value And list of pairs are put in blocks (id_block_value)

(A) If I want height = 1080:

SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "height" and value = "1080");

+---------+
| file_id |
+---------+
|      21 | 
|      22 |
(...)
|    6962 |
(...)
|    8146 | 
|    8147 | 
+---------+
794 rows in set (0.06 sec)

(B) If I want file extension = mpeg:

SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "file extension" and value = "mpeg");

+---------+
| file_id |
+---------+
|    6889 | 
|    6898 | 
|    6962 | 
+---------+
3 rows in set (0.06 sec)

BUT, if I want:

  • A and B
  • A or B
  • A and not B

Then, I don't know what is the best.

For A or B, I tried A union B which seems to do the trick.

SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "height" and value = "1080")
UNION
SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "file extension" and value = "mpeg");
+---------+
| file_id |
+---------+
|      21 | 
|      22 | 
|      34 |
(...)
|    6889 | 
|    6898 | 
+---------+
796 rows in set (0.13 sec)

For A and B, since there are no intersect in Mysql, I tried A and file_id in(B), but look at perfs (>4mn)...

SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "height" and value = "1080")
and file_id in(
SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "file extension" and value = "mpeg"));

+---------+
| file_id |
+---------+
|    6962 | 
+---------+
1 row in set (4 min 36.22 sec)

I tried B and file_id in(A) too, which is a lot better, but I will never know how which one to put first.

SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "file extension" and value = "mpeg")
and file_id in(
SELECT DISTINCT file_id 
FROM metadata_value MV 
     INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
     INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
WHERE (metadata_name = "height" and value = "1080"));

+---------+
| file_id |
+---------+
|    6962 | 
+---------+
1 row in set (0.75 sec)

So... what do I do now? Is there any better way for boolean operations? Any tip? Did I miss something?

EDIT: what data looks like:

This database contains a row in FILE table for each audio/video file inserted:

  • 10, /path/to/file.ts
  • 11, /path/to/file2.mpeg

There is a row in METADATA table for each potential information:

  • 301, height
  • 302, file extension

Then, a row in BLOCK table define a container:

  • 101, Video
  • 102, Audio
  • 104, General

A file can have several blocks of metadata, a BLOCK_VALUE table contains instances of BLOCKS:

  • 402, 101, 10 // Video 1
  • 403, 101, 10 // Video 2
  • 404, 101, 10 // Video 3
  • 405, 102, 10 // Audio
  • 406, 104, 10 // General

In this example, file 10 has 5 blocks: 3 Video (101) + 1 Audio (102) + 1 General (104)

Values are stored in METADATA_VALUE

  • 302, 406, "ts" // file extension, General
  • 301, 402, "1080" // height, Video 1
  • 301, 403, "720" // height, Video 2
  • 301, 404, "352" // height, Video 3

Solution

  • I'm opening a new post only to keep the "correct" solution tidy..

    Ok, sorry, it seemed that I was making the wrong assumption. I never thought about two blocks being defined exactly the same way.

    So, since I'm a copycat, and I like my getting the AND from OR solution (:P), I got to these two solutions..

    ORing: I like Chris's solution better...

    SELECT DISTINCT file_id 
      FROM metadata_value MV 
        INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
        INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
       WHERE (metadata_name = "height" and value = "1080") 
          OR (metadata_name = "file extension" and value = "mpeg")
    

    ANDing: I'll use your ORing version (the one with the UNION all

      SELECT FILE_ID FROM (
         SELECT DISTINCT 1, file_id 
                 FROM metadata_value MV 
           INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
           INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
                  WHERE (metadata_name = "height" and value = "1080")
         UNION ALL
         SELECT DISTINCT 2, file_id 
                 FROM metadata_value MV 
           INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
           INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
                  WHERE (metadata_name = "file extension" and value = "mpeg")
       ) IHATEAND
       GROUP BY FILE_ID
       HAVING COUNT(1)>1
    

    Which gives:

    +---------+
    | FILE_ID |
    +---------+
    |    6962 |
    +---------+
    1 row in set (0.24 sec)
    

    it should be a little less fast than the ORing seeing the performances you pasted and mines (I am 3 times as slow, time to upgrade -.-), but still significantly faster than the previous queries ;)

    Anyway, how does the ANDing work? Put pretty simply, it just does the two separate queries and names the records according to the branch they come from, then counts the different file ids coming from them

    UPDATE: another way of doing it without having to "name" the branches:

    SELECT FILE_ID FROM (
        SELECT file_id 
            FROM metadata_value MV 
            INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
            INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
                WHERE (metadata_name = "height" and value = "1080")
        GROUP BY FILE_ID
        UNION ALL
        SELECT file_id 
            FROM metadata_value MV 
            INNER JOIN metadata M ON MV.meta_id = M.id_metadata 
            INNER JOIN block_value BV ON MV.blockvalue_id = BV.id_block_value 
        WHERE (metadata_name = "file extension" and value = "mpeg")
        GROUP BY FILE_ID
        ) IHATEAND
    GROUP BY FILE_ID
    HAVING COUNT(1)>1
    

    Here the results are the same (and performances as well) and I'm exploiting the fact that while UNION automatically sorts the duplicates and removes the duplicates, UNION ALL does not... which is perfect since I don't want them removed (and in general union all is also faster than union :) ), this way I can forget about naming.