Looking for a little help to elucidate a dimensional model. I'm looking at what boils down to a web event analysis - given web logs, I want to parse and store the variables present in the URL. The trick is that those variables are not always predefined, and on occasion, a variable may contain more than one value.
Let's take a look at a hypothetical. If I have a query string like
session_id=SID&key1=value1&key2=value2&key3=value3a&key3=value3b&key3=value3c
My goal is to be able to figure out aggregates by any arbitrary combination of those keys. For instance, I might say "how many page hits had a key3 value of value3a," or "how many page hits had a key1 value of value1 and a key3 value of value3b". To add to the complexity, it's possible that eventually a key4 and key5 and so on will appear, and potentially without enough forewarning to be able to make dimensional model changes before the values appear.
One approach might be to create 3 dimension tables, dim_key1
, dim_key2
, and dim_key3
, each with an id
field and a value
field.
Then my fact table might look something like
id, session_id, dim_key1, dim_key2, dim_key3, count
The downside with this is that I would then need to create 3 rows in my fact table in order to properly capture the 3 values of key3
from the query string. Additionally, for every dimension that appears, I'd need advance notice, and would need to create my new dimension table.
Another possible approach, more geared for dim_key3, might be to create a dimension table like
id, value3a, value3b, value3c, ...
where rows in that table are made of ones and zeroes representing those combinations of values. So for instance, that query string above would have a row that looks something like 1, 1, 1, 1, 0, 0, ...
, and the page hit fact table would have a dim_key3 dimension id of 1.
On the plus side, each page hit would only have one entry in the fact table, and the dimension table can keep a sparse representation, where we only create new rows in it for combinations we've actually seen (i.e. we don't need the power set of all key3 combinations). On the downside, every new value for key3
would still need a new column added to that dim table.
Last idea, for dim_key3
would be to have a table like id, value_list
, where value_list
stores the comma-separated list of values seen. It's a similar approach to the "column-for-every-value" approach, but just keeps a more compact representation. In this instance, we might have a dimension row like 1, "value3a,value3b,value3c"
.
Similarly, this would only require one row in the fact table, and as an added advantage, wouldn't need new columns as new values show up. The downside would be that it forces some complexity into the query, in having to do full-text matches/regexes. (I can get into more of that if there's interest, but I feel like I've already carried on long enough).
I've looked at several references, including Kimball's "The Data Warehouse Toolkit," and haven't found anything that directly answers my question. Most of the click/web/eventstream analysis dimensional model examples have a fixed set of unchanging and singularly-valued variables.
Are any of my three outlined approaches reasonable, and/or does anyone have any suggestions on another model that I've missed?
Thanks in advance!
What about a design where the fact table included a row for each association between a page load and a key:
RowID, SessionID, PageLoadID, KeyValuePairID, DateKey, etc.
The KeyValuePair dimension table would have:
KeyValuePairID, Key, Value
So for your example, you would create dimension rows
1 key1 value1
2 key2 value2
3 key3 value3a
4 key3 value3b
5 key3 value3c
Then you can do Intersect queries to find your answers
Select Intersect(
NonEmpty([Session].Members, [KeyValuePair].&[Key1][Value1]),
NonEmpty([Session].Members, [KeyValuePair].&[Key3][Value3a])) on 0
From myCube