Search code examples
rdata-visualizationtableau-apilevel-of-detail

TABLEAU: How can I measure similarity of sets of dimensions across dates?


this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.

One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.

Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)

However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!

It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?

Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.

Thanks in advance :)


Solution

  • As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)

    movies_edited <- movies %>% group_by(Title) %>%
      mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
      ungroup()
    
    write.csv(movies_edited, "movies_edited.csv")
    

    This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.

    I have imported this dataset in tableau and created a parameter dependent view, as you desired.

    enter image description here