Search code examples
databasedatabase-designrelational-databaseentity-relationship

Removing entities form a relational database and keep integrity


At first, this question might seem to be trivial, but...

  • Suppose your app manages a relational database as persistency layer.
  • Suppose (for example), the app manages Students, Courses, Teachers, Grades, Rooms.
  • All of those entity types related somehow to each other.

Question is, how one deletes entities from the persistence layer, but keeps relational integrity and logical consistency.

To be more specific: Regulartory requirements regarding privacy may force you to remove a Student A, who attended course B, from the system - after B finished. Obervation:

  • If one would remove Student A, all relations need to be deleted too, to keep the persistence layer integrity. Unfortunately, the number of student that attended the course reduces by one too. In fact, one want's to keep the number of attendants of B constant.
  • If one anonymises Student A by setting first name and name to @anonymous@, an UNIQE(first name, name) conditions gets broken (a bit artificial, but you get the point).

Therefore, I'm in search for the theoretically / practically right solution to remove entities and keep the logical information persisted in the database.

Note: There are quite some big applications, which never delete records. As far as I know, SAP is one of these applications.


Solution

  • One way would be to simply overwrite privacy-sensitive fields with NULLs, without actually deleting the row. If a sensitive field needs to be unique, put it into NULL-able UNIQUE constraint and instead use a surrogate key as target for foreign keys. A data in a surrogate key, by definition, isn't meaningful, therefore isn't sensitive, therefore doesn't need to be NULL-ified, therefore existing links survive.

    In your example, most of the student's fields would be NULL-ified, but he/she would still remain linked to the course.


    Alternatively, have one special "anonymous" row and just relink to it when deleting "real" rows. This may or may not be plausible depending on the required semantics of the links.

    In your example, a course would relink to the special anonymous student when the actual student is deleted. This way, a course still retains the proper number of students, but you now have to implement the ability to link the same course to the same (anonymous) student multiple times, because multiple real students of the same course may be deleted.


    Another alternative would be "special" case-by-case handling, for example by storing a separate counter or other "meta information".

    In your example, if student count (per course) is important even after students are deleted - simply have a field containing student count (in the course table). This field will no longer match the number of actual students once you start deleting students, but will retain the count of the past attendance. You'll have to carefully update this count to maintain data consistency - using triggers and possibly locking is probably the best option. Also, you'll have to figure out what happens if a student drops out from the course - is the number before or after the drop-out the correct one?