Search code examples
powerbianalytics

Power BI Survey data schema


I'm trying to determine the best data model to transform survey data into for Power BI.

I'm pretty sure a star schema is what I'm after, but my survey data doesn't have the same responses for all questions (i.e. it's not rate this from 1-5 for every question) - each of the 20+ questions has it's own pre-defined list of answers.

Most of the questions are along the lines of: "what's your favourite ice cream" and the answer can be one of vanilla, pistachio, strawberry etc.

A couple of the questions are in this kind of format: Rate each of the following flavours of ice cream from one to 5: Pistachio (1-5) Vanilla (1-5) Strawberry (1-5) etc. So they're more like a kind of question group.

There's also a couple of free text answer fields they want turning into word clouds, like: q:"What are your fears for the future" a: "not being able to find pistachio ice cream in a post apocalyptic society"

There are roughly 7000 respondents.

My data is structured with one row per respondent with a bunch of demographic data columns in it (age, ethnicity, annual salary etc.), and about 200+ columns of answer data like this:

ResponseID Age Range Religion Q1 Vanilla Q1 Pistachio Q1 Strawberry Q2 Free Text Q3 Rate Vanilla Q3 Rate Strawberry Q3 Rate Pistachio
1 25-30 Jedi 1 0 0 blah blah 5 4 1
2 35-40 None 0 1 0 6 3 2

So for the example questions I gave above, Q1 would be "What's your favourite ice cream?" selecting an answer gives a 1 or 0 in the appropriate column.

Q2 is a free text answer field

Q3 would have an overarching question like: Rate the following ice cream flavours from 1-5, and each 'flavour' gets a value corresponding to that answer (rather then the 1 / 0 from Q1)

So what's my problem?

I'm OK using power query to get the source data into whatever format I need it to be, but I don't know what that format should be, because I've got potentially 200+ different answers to 20+ questions.

I'm leaning towards one fact table with question ids and answer ids that then reference answer and question dimension tables, but then will I have problems filtering on answers for questions???

Or do I make it a kind of snowflake where the answers are grouped by a question table, so I only get answers for the appropriate questions?

Or do I flatten a "question & answer" table together with a combined key that I can reference back to the fact table?

Any thoughts appreciated.

Cheers

Yonabout


Solution

  • Survey answer models are pretty common scenarios and you absolutely can model it as a star schema. A question dimension and an answer dimension gives quite a lot of flexibility. Some answers can optionally have a ‘score’.

    You’d be able to filter by answer easily by choosing an answer from the answer dimension.

    Since your answers only apply to a question, you absolutely can have questions and answers in the same dimension.

    The important thing is to decide the grain of the fact. The fact would be most flexible if it is one row per answer within the survey, not as you have it now with one row per person’s set of answers.