Search code examples
rdata.tableolapolap-cubedata.cube

Is it possible to set a cube with a snowflake schema in data.cube R?


Or, can dimensions be, in some way nested, in data.cube?

Given the following example (accessed via ?data.cube on R, having installed last branch of data.cube-oop package, by @jangorecki) for which I post code and image example.

Consider I want to expand the cube adding a new dimension which would turn the schema to snowflake, so for each geography location, I would have another set of data (data.table) which would describe demography properties (i.e. population based on gender, age, etc)

Image

Snowflake Schema example image

dotted: possible new dimensions.

black: actual facts and dimensions from code example.

green: new dimension which turns the schema into snowflake.

Code

# multidimensional hierarchical data from fact and dimensions
X = populate_star(N = 1e3)
sales = X$fact$sales
time = X$dims$time
geography = X$dims$geography
# define hierarchies
time.hierarchies = list( # 2 hierarchies in time dimension
    "monthly" = list(
        "time_year" = character(),
        "time_quarter" = c("time_quarter_name"),
        "time_month" = c("time_month_name"),
        "time_date" = c("time_month","time_quarter","time_year")
    ),
    "weekly" = list(
        "time_year" = character(),
        "time_week" = character(),
        "time_date" = c("time_week","time_year")
    )
)
geog.hierarchies = list( # 1 hierarchy in geography dimension
    list(
        "geog_region_name" = character(),
        "geog_division_name" = c("geog_region_name"),
        "geog_abb" = c("geog_name","geog_division_name","geog_region_name")
    )
)
# create dimensions
dims = list(
    time = as.dimension(x = time,
                        id.vars = "time_date",
                        hierarchies = time.hierarchies),
    geography = as.dimension(x = geography,
                             id.vars = "geog_abb",
                             hierarchies = geog.hierarchies)
)
# create fact
ff = as.fact(
    x = sales,
    id.vars = c("geog_abb","time_date"),
    measure.vars = c("amount","value"),
    fun.aggregate = sum,
    na.rm = TRUE
)
# create data.cube
dc = as.data.cube(ff, dims)
str(dc)

other questions related to the example are:

what is the value expected for each element? why "time_week" = character() "time_date" = c("time_week","time_year") instead of "time_week" = character() "time_date" = date() and why this naming as in columns of data.table? "time_quarter" = c("time_quarter_name"), "time_month" = c("time_month_name")


Solution

  • Cube model is underlying structure that user don't have to deal with. data.cube-oop uses the following data model.
    Going precisely to your example in question. You can't add new dimensions in snowflake schema this way. New dimensions must be connected to fact table. Tables in snowflake schema that are not directly connected to fact table are just hierarchy levels in dimensions. In your example that means the customer dimension is just a higher level attributes of geography dimensions. You might eventually do the opposite, create customer dimension and on higher levels of customer hierarchy keep geography attributes.
    In any way you decide to do it, you must create your dimensions from single table each (can be same wide table of course). You cannot construct dimension by providing its levels separately, this would be more confusing than just single denormalized dimension table which is something that users most often deal with. So in order to keep geography attributes in customer dimension just lookup geography values to your customer table and supply as.dimension with new table.


    As for the second part of your question, hierarchy lists defines relationship between attributes in hierarchy, not data types. Column name on LHS define a key column in hierarchy level, while RHS define dependent attributes which will be present on that particular level. You basically define which column goes on which level in hierarchy, lower levels must refer to upper level in order to create real hierarchy. This is enforced by uniqueness of data, i.e. you must have only single time_month_name for each time_month.
    To see this relationship better try the following:

    library(data.cube)
    X = populate_star(N = 1e3)
    dc = as.data.cube(X)
    dc$dimensions$time$levels
    

    It will print all hierarchy levels in time dimension, each hierarchy level is a separate table.