Search code examples
relational-databasedatabase-normalizationreferential-integrity

How to reference groups of records in relational databases?


Humans

| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
|       1 | Issac     | Newton   | M      |
|       2 | Marie     | Curie    | F      |
|       3 | Tim       | Duncan   | M      |

Animals

| AmimalID | Species | NickName |
|----------+---------+----------|
|        4 | Tiger   | Ronnie   |
|        5 | Dog     | Snoopy   |
|        6 | Dog     | Bear     |
|        7 | Cat     | Sleepy   |

How do I reference a group of records in other tables?

For example:

Foods

| FoodID | FoodName | EatenBy |
|--------+----------+---------|
|      8 | Rice     | ???     |

What I want to store in EatenBy may be:

  1. a single record in the Humans and Animals tables (e.g. Tim Ducan)
  2. a group of records in a table (e.g. all dogs, all males, all females)
  3. a whole table (e.g. all humans)

A simple solution is to use a concatenated string, which includes primary keys from different tables and special strings such as 'Humans' and 'M'. The application could parse the concatenated string.

Foods

| FoodID | FoodName | EatenBy      |
|--------+----------+--------------|
|      8 | Rice     | Humans, 6, 7 |

Using a concatenated string is a bad idea from the perspective of relational database design.

Another option is to add another table and use a foreign key.

Foods

| FoodID | FoodName |
|--------+----------|
|      8 | Rice     |

EatenBy

| FoodID | EatenBy |
|--------+---------|
|      8 |  Humans |
|      8 |       6 |
|      8 |       7 |

It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?

The example tables here are a bit contrived, but I do run into situations like this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.


Solution

    1. This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.

    2. The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.

    3. The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.

    Response to Initial Question

    You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.

    1. The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.

      If you want a Relational database, then you are going to have to:

      • think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).

      • Then classifying the data (no relation whatever to the OO class or classifier concept).

      • Then relating the classified data.

    2. The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.

      No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.

      • In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.

      • I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.

      • Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.

      • The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".

    Therefore, I would ask you to:

    • forget about "table structures" or definitions

    • forget about what you want

    • forget about implementation options

    • forget ID columns, completely and totally

    • forget EatenBy

    • think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it

    • think about how that data is classified, and how it can be classified.

    • how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)

    If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:

    • Things

    • Groups of Things

    • Relationships between Things and ThingGroups

    The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.

    Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.


    Response to Update 1 re Hierarchy

    Record IDs are Physical, Non-relational

    If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.

    Please read this Answer.

    Hierarchies Exist in the Data

    If you want a full discourse, please ask a new question. Here is a quick summary.

    Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.

    The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.

    Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.

    There are three types of Hierarchies that occur in data.

    1. Hierarchy Formed in Sequence of Tables

      This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.

      The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.

      The Hidders Response has a great example of Hierarchies:

      a. that exist naturally in the data

      b. that OO types are blind to [as Hidders evidently is]

      c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.

      Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.

      The two hierarchies in those four tables are:

          Domain::Animal::Harvest
      
          Domain::Activity::Harvest
      

      Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.

      Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:

          Subject::CategorySubject::ExaminationResult
      
          Category::CategorySubject::ExaminationResult
      
          Person::Registrant::Candidate::ExaminationResult
      

      Note that last one is a progression of state of the business instrument, thus the Key does not compound.

    2. Hierarchy of Rows within One Table

      Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).

      • The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.

      You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.

      Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:

          Message::Message[Message]::Message[::Message[Message]] ...
      
    3. Hierarchy of Rows within One Table, Via an Associative Table

      This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.

      The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.

      Here is the Bill of Materials hierarchy, modelled and implemented correctly.

      • Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.

      • Person::Progeny is also given.

      • The hierarchies are:

        Part[Assembly]::Part[Component] ...
        
        Part[Component]::Part[Assembly] ...
        
        Person[Parent]::Person[Child] ...
        
        Person[Child]::Person[Parent] ...
        

    Ignorance Of Hierarchy

    Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.

    • Adjacency List

      The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.

      Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.

      And they are clueless as to how to traverse the tree, that one needs recursion.

    • Nested Sets

      An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.

      Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.

      The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.

    Ok, it wasn't so short.

    Response to Update 2

    Response to Update 3

    Response to Update 4