database database-design relational-database database-schema

Is it reasonable to separate relation atributes with frequently duplicate string values?

Consider relation SPACESHIP_FLAT:

╔═════╤════════════════╤════════════════╤═══════════╗
║ ID  │ NAME           │ TYPE           │ LV        ║
╟─────┼────────────────┼────────────────┼───────────╢
║ 1   │ Soyuz TMA-14   │ Soyuz          │ Soyuz-FG  ║
║ 2   │ Endeavour      │ Space Shuttle  │ Shuttle   ║
║ 3   │ Soyuz TMA-15M  │ Soyuz          │ Soyuz-FG  ║
║ 4   │ Atlantis       │ Space Shuttle  │ Shuttle   ║
║ 5   │ Soyuz TM-31    │ Soyuz          │ Soyuz-U   ║
║ 6   │ EFT-1          │ Orion          │ Delta-IV  ║
║ 7   │ XXX            │ CST-100        │ Delta-IV  ║
║ 8   │ YYY            │ CST-100        │ Falcon-9  ║
║ 9   │ ZZZ            │ Dragon V2      │ Falcon-9  ║
║ ... │ ...            │ ...            │ ...       ║
╚═════╧════════════════╧════════════════╧═══════════╝

Here attributes TYPE and LV have duplicate string values. So my question is: is it reasonable to project these attributes to new relations? Result will be as shown below.

SPACESHIP

╔═════╤════════════════╤═══════╤══════╗
║ ID  │ NAME           │ TYPE  │ LV   ║
╟─────┼────────────────┼───────┼──────╢
║ 1   │ Soyuz TMA-14   │ 1     │ 1    ║
║ 2   │ Endeavour      │ 2     │ 2    ║
║ 3   │ Soyuz TMA-15M  │ 1     │ 1    ║
║ 4   │ Atlantis       │ 2     │ 2    ║
║ 5   │ Soyuz TM-31    │ 1     │ 3    ║
║ 6   │ EFT-1          │ 3     │ 4    ║
║ 7   │ XXX            │ 4     │ 4    ║
║ 8   │ YYY            │ 4     │ 5    ║
║ 9   │ ZZZ            │ 5     │ 5    ║
║ ... │ ...            │ ...   │ ...  ║
╚═════╧════════════════╧═══════╧══════╝

SSTYPES

╔═════╤════════════════╗
║ ID  │ TYPE           ║
╟─────┼────────────────╢
║ 1   │ Soyuz          ║
║ 2   │ Space Shuttle  ║
║ 3   │ Orion          ║
║ 4   │ CST-100        ║
║ 5   │ Dragon V2      ║
║ ... │ ...            ║
╚═════╧════════════════╝

LVEHICLES

╔═════╤═══════════╗
║ ID  │ LV        ║
╟─────┼───────────╢
║ 1   │ Soyuz-FG  ║
║ 2   │ Shuttle   ║
║ 3   │ Soyuz-U   ║
║ 4   │ Delta-IV  ║
║ 5   │ Falcon-9  ║
║ ... │ ...       ║
╚═════╧═══════════╝

I've normalized my relational model to 5NF and found no reason to separate TYPE and LV from the relation (these does not cause update anomalies). But in case of huge numbers of tuples in SPACESHIP_FLAT relation it will consume a lot of resources (as I suppose) - so separating them will be a bit more efficient. But I did not found it in theory of database design.

Attribute TYPE entirely depends on attribute NAME. NAME indicates an instance of class TYPE - one and more instances belong to one class. Soyuz TMA-14 is an instance of Soyuz ship series. So it can have one and only one value of TYPE. Relationship between two sets of values NAME and TYPE is many-to-one (no multivalued dependency).
Attribute LV depends on attribute NAMEin the same manner.
But if I eventually decide to clarify the Soyuz TYPE and set all the Soyuz TMA-XX spacecrafts as Soyuz TMA TYPE, Soyuz TM-XX as Soyuz TM and so on then I need to update each Soyuz* record no matter if I use the first variant with flat relation or variant with three different relations. The second one will ease editing value Soyuz (so it looks better in terms of data integrity).
As for surrogate keys - I use 'em because I need them to be represented in cross-reference relations (many-to-many, even SPACESHIP_FLAT or SPACESHIP- there are no top-level relations in my data model).

Solution

Preliminary

I don't know what books you are reading, but I do note that the modern books are nonsense. Your single table is a flat file.

You have two tables named SPACESHIP. In order to avoid confusion, I will call the single table version SPACESHIP_FLAT.

I've normalized my relational model to 5NF

Sorry, that isn't Normalised. Not even to 1NF.

If we consider 1NF and 2NF taken literally, by the authors who write books in the post-Codd era, sure, it "satisfies" 1NF and 2NF.
But if we consider that we do not need a long list of the exceptions, that we are not seeking to undermine the definition of 1NF and 2NF, that is we take it in spirit and intent, then your single file fails 1NF and 2NF.
Likewise, the Codd definition for 3NF is full and complete. But the newcomers need BCNF, 4NF, 5NF, and a dozen or so NFs that have not been written yet, to approach the level of Normalisation that the original definition considers as 3NF by spirit and intent.

It is not relational either, because it breaks a number of rules in the Relational Model.

You have surrogates only, Record IDs, which are an artefact of the pre-relational era, which most modern authors have regressed to, as it is the only thing they can understand. But they incorrectly label that as "relational".
Thus you have no Relational Integrity (that is distinct from Referential Integrity), power, or speed.
Record IDs and the like are expressly prohibited.

and found no reason to separate TYPE and LV from the relation

Sorry, it is not a relation either. It is the Universal Relation, a derived relation, a flattened view of all relations.

Note that the post-Codd authors do not (and can not) differentiate between base relations and derived relations, they attempt to "normalise" both, and squeal that "SQL can't ..." and "Relational Model doesn't ...".
The two biggest opponents of the original model are disagreeing with 1NF.
We need to Normalise and design base relations only.

these does not cause update anomalies.

Sorry, false again. That probably means you do not understand what an Update Anomaly is (as opposed to, your flat file does not have them).

There are a few different types of Update Anomaly. They can be defined as a group, by the following simple (that means simple, technical, not complete academic) definition.

An Update Anomaly is one where, when you need to update just one row, the tables are such that other rows (in the same or other table) must also be updated, in order to preserve integrity and Consistency of the data.

If I UPDATE the TYPE field in record ID 5 of SPACESHIP_FLAT file, from Soyuz to Proto-Soyuz, then the records 1 and 3 are no longer consistent, they have to be updated as well. That is an Update Anomaly.
If I approach the same update in the somewhat Normalised three-file version, I UPDATE the TYPE field in record ID 1 of SSTYPES file, from Soyuz to Proto-Soyuz, that change will be reflecting in the projections that use SSTYPES. No other records need to be updated. There is no Update Anomaly.
An Update Anomaly is a consequence of invalid or incomplete Normalisation. It is a flag, an alarm. It cannot be corrected in and of itself, you have to revert to the Normalisation stage, and correct it there.
NB. That is just one type of Update Anomaly. But they are all consequences of errors in the Normalisation exercise.

But in case of huge numbers of tuples in SPACESHIP relation it will consume a lot of resources (as I suppose) - so separating them will be a bit more efficient

Good point, definitely yes, but not for the right reasons. The right reasons and procedure result in separated tables which have Relational Integrity, are far more efficient, etc, etc, etc,

But I did not found it in theory of database design.

That is because Database Design and Relational Theory is not correctly taught. The books I have seen are horrible, and the Alice book is particularly bad.

Integrity

What you do not appreciate is, the data in the single file has no integrity. The data in the three-file combo has more integrity, but not the full set that the Relational Model provides.

Answer

So, the three-file cluster is more Normalised than the single file.

But it needs more Normalising, more Relationalisation, to be complete.

get rid of the Record IDs.
If you would like some detail re the horrors of Record IDs, read this Answer, from the top, to False Teachers, and the section of Lookup Tables.
Keep Normalising (you are doing a great job, understanding it intuitively), until you have:
no repeating data whatsoever
all the Facts (a database is a collection of Facts, about the real world) that you need
expressed as relational Keys

Then you will have a Relational Database.

Throw those books out. Read only Dr E F Codd. His paper is freely available, but the terminology is out-of-date, and it is seminal, the implications of certain terms are lost.

Response to Comment & Update

And trying to create logical model as optimal as possible.

Yes. So we will keep this question and answer at the logical model level. And when it is resolved, normalised, optimal (at the logical level only ), then you can proceed to the physical model.

In that case, be advised that ID fields are physical, not logical. ID fields do not exist in the data that is being considered (the domain of discourse). Modern books will mislead the reader, introduce IDs for everything, and other crimes against the science. Remove them and then we can start with the logical model. We cannot start when the data is infected, when it contains additional contaminants that do not exist in the pure uncontaminated data.

Attribute TYPE entirely depends on attribute NAME. NAME indicates an instance of class TYPE - one and more instances belong to one class. Soyuz TMA-14 is an instance of Soyuz ship series. So it can have one and only one value of TYPE.

(I will be using Relational terms, not OO/ORM terms. We don't have classes and instances, we have domains and tuples or rows.)

Relationship between two sets of values NAME and TYPE is many-to-one (no multivalued dependency).

Understood, and accepted. With one exception: your two paras mean NAME is dependent on TYPE, not the other way around (your opening sentence contradicts the rest).

Now in the Relational Model, we have degrees of Dependence.

Typical for Lookup tables that are referenced, and for all files in Record Filing Systems, where it cannot be said that the subject (referencing) table rows cannot exist unless the referenced table exists, the Dependency is Non-Identifying. The relationship line is dashed.
Typical of the great majority of tables in a Relational Database (minus those Lookup tables), where the subject table rows exist only in the context of the referenced table, the Dependency is Identifying. The parent PK is used to form the child PK. The relationship line is solid.

If SPACESHIP exists only in the context of TYPE, then SPACESHIP is (a) Dependent on TYPE, and (b) the TYPE PK is used to form the SPACESHIP PK.

I tend to think that SPACESHIP is Dependent on both TYPE and LAUNCH_VEHICLE, and TYPE and LAUNCH_VEHICLE each Identifies SPACESHIP.

Attribute LV depends on attribute NAME in the same manner.

No. NAME depends on LAUNCH_VEHICLE (LV), not the other way around.

But if I eventually decide to clarify the Soyuz TYPE and set all the Soyuz TMA-XX spacecrafts as Soyuz TMA TYPE, Soyuz TM-XX as Soyuz TM and so on then I need to update each Soyuz record no matter if I use the first variant with flat relation or variant with three different relations.*

That is exactly right.

That is exactly why I stated that your model is not in 1NF. Your "name" is in fact not a string of characters, it is made up of parts. And those parts have specific meaning (which Date cannot understand). Those parts are in fact (a) TYPE (b) LAUNCH_VEHICLE, and (c) something that has not been articulated yet. The TMA vs TM-XX has meaning and it is similar to TYPE, but a separate set.

I will let you work that (c) out. Notice the repetitions; figure out what each component part means; etc. Then extract that out into a separate table.

The true Names are Discovery, Endeavour, Volna, Yenesei, Lena, etc. The thing that is labelled NAME is a concatenation of formal classifiers (TYPE, LAUNCH_VEHICLE, and XXXX).

The second one will ease editing value Soyuz (so it looks better in terms of data integrity).

We don't care about looks in science. It is either right or wrong. A certain column either has data integrity or it doesn't. The first does not have any data integrity, the second has some data integrity (that which has been discussed, and there is more to be had).

As for surrogate keys - I use 'em because I need them to be represented in cross-reference relations (many-to-many, even SPACESHIP_FLAT or SPACESHIP- there are no top-level relations in my data model).

There is no such thing as a "surrogate key". It is simply a surrogate. It has none of the qualities that a Key has. The term "surrogate key" is therefore misleading, because one naturally expects some, if not all, the qualities of a Key, and it does not have any of those qualities.
Surrogates are physical, not logical. We don't need them at this logical stage.
We do not need surrogates in the Relational Model. Especially not for relating rows. The Record Filing Systems that preceded the Relational Model used Record IDs (physical, surrogates) to relate records. The big difference in the Relational Model, that gives it Relational Power, is that it uses logical Keys, not Record IDs (physical, surrogates), to relate rows.

Anyone can relate the entries in a spreadsheet, by numbers. It takes a bit more understanding to (a) give up the spreadsheet view of data, and (b) relate entries by Key.

One-to-one; one-to-many; many-to-many, using Keys, is no problem at all.

that book is C. J. Date. Introduction to Database Systems - contained more theory on database systems of all books I've managed to grab (in my native language).

So what: if it is bad, it's bad. If it furthers the understanding of the Relational Model, it furthers the understanding of the Relational Model.

The evidence, both here in this question, and in every project that followers of this style produce an anti-relational result. Here he has got you thinking that you are producing a Logical model that conforms to the Relational Model, but the evidence is, it is an anti-relational Physical filing system, with none of the capability, Integrity, power, or speed of a Relational Database, with pre-relational Record IDs.

If you want more proof, just look at my answers on Stack Overflow (go to my profile). Note that the seekers are genuine people, who, like you, are following those books, and creating monstrous filing systems, while believing then to be "relational" and "database". That is Date's real success.

These anti-relational books are heavily marketed. Now they are being used as textbooks in universities. So what. They are still wrong.

I repeat, throw out those books, and read only Dr E F Codd. And of course, genuine disciples of Codd. Not the people who reference Codd but teach the opposite.

C. J. Date said that 5NF will not make the relation free from all update anomalies (as you've shown on SPACESHIP_FLAT example).

Date and his colleagues disagree with Codd's work. During the Codd era, we had 1NF, 2NF, 3F. It was objective, and no one argued.

Then the era of mathematicians who improve the science came along. They allegedly found "holes" in 3NF. They wrote theses and mathematical definitions to "plug the holes" (of their own making). So now they have BCNF, 4NF, 5NF. Which, as far as guys like me are concerned, are superfluous nonsense.

They argue amongst themselves re what an NF is, or is not, about what a mathematical proof is, or is not. Right now Date and Darwen are trying to change the definition of 1NF, to suit their own purposes, after it has been defined and used by millions, for forty five years.

Most of them agree (as of today, but that could change tomorrow) that 5NF ensures no Update Anomalies.

I have several large banking databases that were honest 3NF, long before 5NF was hatched, I made written declarations that they did not have Update Anomalies. My customers had occasion, ten years later, after 5NF was hatched, to ask me to comply with 5NF, in order to ensure no Update Anomalies. I examined the old 3NF data model, and to everyone's surprise, it "satisfied" 5NF, and no surprise, it did not have any Update Anomalies.

How did that happen ?

By:

(a) being technically honest (ie. not trying to find ways of not-complying, while looking like I am complying) and

(b) Normalising according to science and principle (ie. not according to the fragments of NFs in the form of "mathematical definitions")

I just Normalised, and ensured that there were no Update Anomalies.

That act, ensured that the database would "satisfy" BCNF, 4NF, 5NF, and any other NF that has not been defined or "mathematically defined" yet. As evidenced.

As far as I am concerned, based on my 36 years in the science, BCNF, 4NF and 5NF, are all simply their 3NF restated (they re-defined 3NF after Codd died).

They need it to "plug holes" because they disagree with the Codd definition of 3NF, and then, that subverted "definition", sure enough, has "holes".

Meanwhile, database experts use the Codd definition for 3NF, in which there are no holes, there is nothing to plug.

Up to 2007, I was producing databases that were 95% DKNF, according to the Codd definition and intent in the Relational Model. That last 5% was prevented, due only to a limitation in SQL. Since 2007, when the last obstacle was removed, I have been producing 100% DKNF databases.

But the "theoreticians" flatly state that it cannot be done. They are relying on their "mathematical poofs". I am relying on Codd's definition.

Even when I wrote to the authors of the DKNF "mathematical definition", and gave them a full data model plus diagrams plus documentation, it turned out, they couldn't understand it.

They understand only (a) "mathematical definitions" that are (b) based on their previous "mathematical definitions".
A proof, in and of itself, that they do not understand Codd or his Relational Model. Let alone follow his model or his intent.

By the evidence that they produce themselves, the post-Codd authors do not understand the Relational Model. Note this, very gravely: therefore they do not know, they do not understand, what the Relational Model delivers: Relational Integrity (as distinct from Referential Integrity); Relational Power (at your level of questioning, this means JOIN power); Relational Speed. They only know RFS non-integrity; RFS navigation; RFS speed.

Update Anomaly

Bottom line:

Normalise by science, by principle, by logic. Most important, by the method given by Codd in the Relational Model.

Not by NF definitions. You can't Normalise via NF definitions anyway, because they give no method.
Note that there are two NFs plainly defined by Codd in the RM, that the authors who write books in the post-Codd era, alleging to be "relational" do not mentioni them, or the method, in their work.

Prevent Update Anomalies by knowledge of what an Update Anomaly is.

Not by the fleeting "mathematical definition", because (a) that will not produce a database that is free from Update Anomalies, (b) they argue among themselves whether the "mathematical definition" "proves" what it proposes to prove, and (c) the "mathematical definitions", proofs, and arguments, change all the time.
That is all evidence of subjective truths (non-permanent, changing), of pseudo-science. Science is based on objective truth. One of the qualities of objective truth is, it is permanent, it does not change.
Note also, Philip's notion of "update anomaly" is an opinion. It has nothing to do with the definition, which was cast decades ago, which is an objective truth. While your data model fails to prevent Update Anomalies (as per the evidence I provided), it may well "satisfy" someone's, anyone's, opinion of chop suey. Meanwhile, regardless of these ever-changing opinions, the data model continues to fail to prevent Update Anomalies.

One more question: do I need to get rid of all surrogate keys?

Yes. If you want a Relational database. If you are happy with a Record Filing system that has none of the Integrity; Power; or Speed that a Relational Database has, you can keep them. And all the manual labour that goes with RFS.

As explained above. They are physical, not logical. They are not Keys, they are pre-relational physical pointers in Record Filing Systems.
The fact that any particular post-Codd author uses Record IDs, proves that that author has no understanding of the most basic, fundamental requirement in the RM. They understand only RFS, therefore that is all they can teach.
The fact they they teach Record IDs at the logical model level proves that they are actively committing a fraud, making a physical ID look like a logical object, making it look like a "key". Anti-relational.

Code

If so - how it will affect foreign keys, JOINs?

What do you mean ? Do you need code ? Well, your model is not progressed enough to code from, and I can't use the IDs that it has now. Let's use this Data Model.

That is an IDEF1X data model. IDEF1X is the Standard for modelling Relational Databases, that we have had since 1987 (as a Standard since 1993). Date, Darwen, Fagin, et al, steer clear of it, they use text only, or invent their own diagrams. Using a Standard would expose their methods as sub-standard.
Please be advised that every little tick; notch; and mark; the crows foot; the solid vs dashed lines; the square vs round corners; means something very specific and important. Refer to the IDEF1X Notation. If you do not understand the Notation, you will not be able to understand or work the model.
Foreign Keys are bold, and use the exact same column name as the parent PF. The exception is when (a) there is more than one FK to the same parent, or (b) for clarity and meaning, where we use RoleNames.
TaxonomyNo might look like a surrogate to you but it isn't. That model is very mature (seventh iteration). Notice how it is used to build an hierarchy of Taxonomy classes. That is the same implementation as the Unix Inode (which is also numeric, and not a surrogate), which is famous for its simplicity and power.

Let's get a report that uses a many-to many relationship such as you are concerned about. List all Species (not the entire Taxonomy tree, just the leaf level) and their Activities. "Activity" is CHAR(X) code, we want the Name:

    SELECT  [Species]  = T.Name,
            [Activity] = A.Name
        FROM Species S
            JOIN Taxonomy T 
                ON S.SpeciesNo = T.TaxonomyNo
            JOIN SpeciesActivity SA
                ON S.SpeciesNo = SA.SpeciesNo
            JOIN Activity A
                ON SA.Activity = A.Activity

And is it ok to realize many-to-many relationships between two entities using composite PRIMARY key that consists of two foreign VARCHAR keys?

Yes. That is normal, pedestrian, ordinary. That is what Relational means, relating by Key. Most Keys in a Relational Database are compound Keys (composite).

SpeciesActivity is an example of such.

Using an ID column on that table would be totally superfluous, because the combination (SpeciesNo, Activity) is the only combination that provides row uniqueness; thus it is the PK.
No amount of "unique" record IDs can improve on that.
And if it were not the declared PK, regardless of the number of "unique" Record IDs used, then duplicate rows would be guaranteed.

This next point is about the physical, not logical, but I won't avoid it. That VARCHAR has to go. Use of VARCHAR everywhere is another mistake.

Use fixed length always, variable length only when absolutely necessary.
The use of VARCHAR in Keys is particularly bad, because that ends up in one or more indices, which means every index entry has to be packed/unpacked, on every access to those entries.
If you do not know what the length should be, that itself is proof that you do not know the data enough to be modelling it. So go and find out about the data, get to know and love it. Not for the removal of VARCHAR, but for the familiarity that is demanded before modelling.

Response to Comments 2

So database model which is in MySQL Workbench can not be called logical model, does it?

Sorry, I have little idea what MySQL does (I have a good idea what what they do not do), and I don't touch them or PCs unless I absolutely have to, so I can't really answer that question.

But I can guess. That question implies that it automatically includes ID fields. See if there is a setting to stop that behaviour. If not, then yes, it can't be used for a logical model, or a physical model for a Relational database. No matter what you call the thing it produces.

If I remove all IDs from the relations then I will get the same relations (as SPACESHIP_FLAT) but with Foreign Key constraints on attributes TYPE and LV as well as two additional relations SSTYPES and LVEHICLES that hold criteria for FK constraints.

Yes.

Stated another way, you could use the 3-table model; remove the ID fields; and replace the numbers in SPACESHIP.TYPE and LV with the strings. Then you have a (a) genuine Relational data model, and (a) a logical model.

But wait, there's more. As part of the normal Modelling exercise, you might notice that carrying wide columns such as TYPE and LV, as FKs in the child, is a bit silly, as well as resulting in larger storage. What is commonly done at this stage, before finalising the logical model, is to use a short CHAR(1) or CHAR(2) code, which meaningfully represent the string.

This is much better than an ENUM or ID.
Note that this assists coders when they are debugging, you can identify the meaning (logical Keys have meaning) from the SPACESHIP rows itself (without a JOIN to the Lookup tables).
```
  SSTYPE
  SS_Code Name
  ------- -------------
  Sz      Soyuz         
  SS      Space Shuttle 
  Or      Orion         
  C1      CST-100      
  D2      Dragon V2
```

SPACESHIP will then have SS_Code and LV_Code as FKs.

And yes - that would be great for me to be able to see the difference between database concept[ual model] and database logic[al model].

Sorry, I can't help you there either. I repeat, in my considered opinion, the concept of a conceptual model is (a) a farce, and (b) completely unclear (every "theoretician" has a different notion of what it is).

I never use separate model for logical vs physical, I simply extend the logical into the physical. Only ERwin provides that functionality, the rest of the CASE tools require separate models, and the migration; synchronisation; replication; and the usual host of horrors associated with duplicating anything.
If the customer demands a conceptual model, I ask them to define it, in terms such that I can quote for it. That usually shuts them up.

Perhaps I should rephrase my question as: Shall I add Foreign Key constraints on attributes?

Most definitely. If you do not have FK Constraints, then you do not have (a) Referential Integrity, or (b) a database, let alone a Relational one.

One more note. You call them "attributes". They are not. They are Keys. Either Primary or Foreign, depending on the subject location. The RM differentiates between Keys and attributes, and they are treated differently. In a good implementation, that treatment is carried all the way through to the physical, using private Datatypes (Domains in the Logical), etc.

The post-Codd theoreticians do not have Keys. They have only physical ID fields which they call "keys", and thus everything else (actual Primary Key fields, as well as Foreign Key fields, as well as non-key fields) is an attribute. Which does not have the value of a Relational Key. All of which is non-relational.
They have "candidate keys" which is a refusal to accept the prescriptions in the Relational Model. It is not simply a difference in terminology.
Further, the label "candidate" is a joke, because one of them has to be chosen as Primary, early in the logical stage (as you have seen), and following the election, the losers are no longer "candidates", they are losers (they are Alternate Keys in the RM). This is done in this way, to conceal the fact that (a) they have no Relational Keys (logical), and (b) they are using a Record ID (physical pointer) as "primary key". And therefore, as a consequence, oh, oh, the some attributes are "candidates".

Which brings me back to your initial question:

Is it reasonable to separate relation attributes with frequently duplicate string values?

The exercise of Normalisation is primarily the removal of duplicate values, and deployment to separate tables.

It is not only reasonable, it is demanded. (Which is why I stated at the outset, that the thing was not Normalised.)

And now you know that they are not attributes, they are Relational Keys.

This simple model doesn't have it, but most Relational tables will have compound (composite) Keys.