django orm polymorphism relational-database normalization

Normalized, denormalized, or what? A table to store responses in a Survey database

I'm framing this intentionally as a general question, even though I'm using Django's ORM to implement these tables. Here are my Django models if you want to have a peek.

The Riddle

I have several tables with the following relationships:

Survey <--m2o-- Page <--m2o-- Category <--m2o-- Question <--m2o-- Choice <--?-- Response
|               |             |                 |                 |
|- name         |-number      |- name           |- sortid         |- sortid
                              |- sortid         |- text           |- text
                              |- text           |- short
                                                |- qtype

Where m2o is short-hand for many-to-one, each table has an implicit primary key, and my question is with regard to how to organize the response table.

The Response table should have some kind of relation out to a User table (depending on the table definition, this might be one-to-one or many-to-one). The purpose of the table is to store user responses to particular questions. The problem I'm having is that depending on question type, I would need to have the response stored in a different column type in the Response table. For example a table might resemble:

| user | (question) | (qtype)         | choice    | response                      |
|------+------------+-----------------+-----------+-------------------------------|
| 1    | Q1         | Select One      | A         | False                         |
| 1    | ''         | ''              | B         | False                         |
| 1    | ''         | ''              | C         | True                          |
| 1    | ..         | ''              | D         | False                         |
| 1    | Q2         | Select Multiple | 1         | True                          |
| ''   | ''         | ''              | 2         | True                          |
| ''   | ''         | ''              | 3         | False                         |
| ''   | ''         | ''              | 4         | True                          |
| ''   | Q3         | Long Text       | NULL      | "It was the best of times..." |
| ''   | Q4         | Select Explain  | A         | False                         |
| ''   | ''         | ''              | B         | False                         |
| ''   | ''         | ''              | C         | False                         |
| ''   | ''         | ''              | D (other) | True                          |
| ''   | ''         | ''              | Explain   | "I actually prefer bananas."  |

I've added columns for clarity that would be implicitly defined by relationships, and therefore not present in the actual table. They're denoted with ( ). Also, choice would be actually be a foreign key.

It would appear that the response column might need to be a text column. Breaking it out into two or more columns like response-text and response-boolean feels messy. We could easily have hundreds of rows for a single user.

I've also thought about organizing the table like this:

| user | survey | response (key-value store)                 |
|------+--------+--------------------------------------------|
|    1 |      1 | {"q1": "C", "q2": "1,2,4"                  |
|      |        | "q3": "It was the best of times..."        |
|      |        | "q4": "Other: I actually prefer bananas."} |
|    2 |      1 | {...}                                      |
|    3 |      1 | {...}                                      |

How would you define this table, and why?

Bonus points if you explain in what contexts your definition would not work well.

If you'd like to imagine my situation, here are some notes on my context:

The response table might be updated a few times during the survey, but never afterward. This particular table is OLAP, or read-mostly. The only queries I'll be doing are to export the table for statistical analysis in a commercial software package.
Responses from different users will rarely be identical (from a high level)
Performance isn't an issue in the sense that only 15 users will be simultaneously connected to the application during a survey administration session. Only one user (me) will be looking at the data after it is stored.

I suspect my situation is close to the example of N1NF on Wikipedia.

Should this table be polymorphic?

I've just heard of multiple-table inheritance, or polymorphism, and I'm not sure how to apply it to this problem. Specifically, I'm not sure which tables would benefit from inheritance. For example, should Response extend Choice? Should Page extend Survey?

Solution

I'm the lead architect & developer of http://360test.nl/ which uses surveys to measure team performance etc.

Our basic table structure is not that different from yours.

Answered are stored in a rather simple table:

ID | Participation | Question | RawValue

Participation is the id of the participation of one user in one survey. Question is the id of one question in the survey. RawValue stores one single answer. It's format depends on the question type. It can be a single number (for a simple "pick one" question) to a more complex json construct (e.g. for questions of the type "distribute X points over those Y choices). The question types know how to serialize/deserialize the data to something usable.

We don't use inheritance in our tables. It would make sense if we needed online queries that investigate the answers ("how many users picked answer 3 in survey xyz"), but we don't need that. Thus, we can use our simple "generic" RawValue column. When we show the results of a survey, all answers are loaded, their values are deserialized and evaluated according to the question type. If we need more special reports, we generate them offline and store the results somewhere.

I think that even your suggested approach (store all answers to one participation in one column) would work for us, as we load the whole data when performing the evaluation anyway.

I'm not saying that our approach is the best, but it works for us.

As a general note - normalization is an important topic. But that doesn't mean, everything needs to be normalized. While I leave the persitent representation of the core domain model normalized, I sometimes store severely denormalized data in the database for reports generated from the domain model. It's just important that you know that it is denormalized and where the data came from.

I have written software using an I think the CQRS architecture and one thing I learned is that denormalization can be a very nice thing. Some CQRS gurus suggest one table per web page which contains all (denormalized) data necessary for that page. While this isn't necessarily what you should do all the time, it shows that text book normalization isn't a requirement in all cases.