Search code examples
mysqlsqlmariadbdatabase-normalization

How to normalize/flatten a set attribute to an intersection table using SQL?


I was given the task to improve an old database. I can work with phpMyAdmin that connects to a MariaDB.

There's a table Region with a column called Super. The Super attribute stores the m:m-relationship to a table called Subcontinents. Here's the catch: instead of using an intersection table, Super is of the type Set/Enum and contains all subcontinents associated with a given region. The allowed values are hardcoded in the table definition and are not linked with a foreign key.

Regions Definition

Name Type
Id int(11)
Name text
Super set('1', '2', ...) = Ids of Subcontinents table

Regions Example

Id Name Super
123 Atlas 17
456 Europe 8,9,10,11

Subcontinents Definition

Name Type
Id int(11)
Subcontinent text

Subcontinents Example

Id Subcontinent
8 Northern Europe
9 Eastern Europe
10 Southern Europe
11 Western Europe
17 Eastern Africa

What I want to do now, is to create an intersection table between the two tables. I could not find out, how I can flatten the records with multiple Super values. The desired output would be something similar to this:

RegionId SubcontinentId
123 17
456 8
456 9
456 10
456 11

I tried to query the Super attribute like SELECT id, super, (SELECT * FROM super) as target FROM Region but apparently that's invlalid syntax. I also tried to map the set value to an integer, but I didn't now how to proceed from there either. Searching the internet brought up a lot of material about normalizing databases, sadly none of it contained an example with a set.

PS: I know how to create tables, move data between them and add constraints.


Solution

  • Apparently there's a function FIND_IN_SET which helps with this problem. The syntax is:

    FIND_IN_SET(pattern, strlist)
    

    One can use it as a condition in a join:

    SELECT r.id, r.super, s.id
      FROM Region as r
      JOIN Subcontinents as s
        ON FIND_IN_SET(s.id, r.super)
    

    Which results in:

    r.id r.super s.id
    456 8,9,10,11 8
    456 8,9,10,11 9
    456 8,9,10,11 10
    456 8,9,10,11 11

    Please be aware of the pitfalls @Akina pointed out in the comments:

    • There're no sets, they're in fact CSV value packs.
    • FIND_IN_SET is a string function.
    • Spaces matter. 8,9,10 is not the same as 8, 9, 10. In the latter, 8 would be found by FIND_IN_SET, but 9 (missing the leading space) won't.
    • Performance is bad.