shiny server: what is the best practice to update data on the server

I have a shiny app that is loading data from some files. On a server, what is the best way to update those files without interrupting the server?

Searching the internet, I found these two solutions:

1) Use reactivePoll() or reactiveFileReader()

http://shiny.rstudio.com/gallery/reactive-poll-and-file-reader.html

2) use reactiveValues()

Update a data frame in shiny server.R without restarting the App

values <- reactiveValues()
updateData <- function() {
  vars <- load(file = "my_data_frame.RData", envir = .GlobalEnv)
  for (var in vars)
    values[[var]] <- get(var, .GlobalEnv)
}
updateData()  # also call updateData() whenever you want to reload the data

output$foo <- reactivePlot(function() {
  # Assuming the .RData file contains a variable named mydata
  plot(values$mydata)
}

What is the best practice to reload files that are loaded in shiny?

Thank you for any input!

Solution

Let me try to reframe your question, positioning some of the paper / sample codes you are referring to.

AT very high level (i.e. without worrying so much about reactivity), R + shiny doesn't differ from a standard way to treat data as part of an ETL process (for example).

I.e. you can load into the shiny server one of the following types of external data:

Load data at rest, i.e. data residing in a file in the filesystem, or executing a RDBMS query. This is the standard case that covers most of the usage.
Load data in motion. This refers typically to a stream of data of some type that you are trying to analyse (i.e without persisting it into a file or a RDBMS table).

Lets talk about the different varieties of the first case first, data at rest:

server <- function(input, output, session) {
---
output$foo <- reactivePlot(function() {
  someQuery <- dbGetQuery(...)  # some query of a database
  plot(values$mydata)
}
---
}

The code above will run a query every time the reactive function is executed.

And this is where reactivity can help a big deal: for example with no other changes, the code above will be executed once for each user connecting to the application.

If the underlying data is getting updated frequently by an external process, the results for different users may be different.

Moreover, anything that cause the reactive construct to be re-execute, will re-execute the query as well (for example, just refreshing the browser the query will be re-executed, as each browser refresh generates a different session).

As you should know from any shiny training, the next steps could be to link the above reactive construct to some other UI element, for example an action button or a selectInput to filter the data.

server <- function(input, output, session) {
---
output$foo <- reactivePlot(function() {
if((length(input$actionbutton) ==0) | (length(input$selectData) == 0)) return()  
# the reactive now is connected to these two shiny inputs and executed every time they change

someQuery <- dbGetQuery(...)  # some query of a database, maybe with a *where* clause dependent on input$selectData
  plot(values$mydata)
}
---
}

Now the query will be executed every time the action button is pushed or a new selection is made.

Let's suppose that for your use case, as I'm sure you have either seen or implemented in ETL, your data is changing often. Suppose the file (or table) is continuously updated by an external process.

Please note that this use case is usually still considered at rest, even if updated frequently (you are processing the data through batches, or if the interval is really small, mini-batches).

It is here that your first example, where the different constructs of reactiveFileReader and reactivePoll enters into play.

If you have a file, for example a log file, updated very frequently by an external process, you can use reactiveFileReader.

If you have database table you can for example poll it every x seconds with reactivePoll.

Here your code can enjoy the full benefit of reactivity: automagically the code will be executed every x seconds for you and all the rest of your reactive code dependent on it will also be refreshed.

Now, lets assume you try to decrease the *batch size" (i.e. the window) over which shiny checks on data. How far can you go?

If I remember correctly a discussion with Joe Cheng a while back, he was confident that shiny would be able to handle up to 50,000 events per second (imagine to poll your database or read your file up to as many times per second).

Assuming that I remember this correctly, I would anyway consider 50,000 events a theoretical limit (you would have to discount the time taken to query your data in a RBMS, possibly over a LAN etc.), so for file access I would use something > 1 millisecond (i.e. <1,000 file read per second), and a much bigger time interval for a RDBMS.

It shouldn't therefore really surprise that the unit of time for the above functions is the millisecond.

I think that with the above constructs it is possible to implement using R + shiny very ambitious micro-batch pipelines.

It could be even possible imagine to use Apache Kafka to publish data to R + shiny (maybe serving Kafka using multiple instances of Shiny Server Pro with load balancing: yummy!)`

So, what about data in motion?

Well, if you get data from a firehouse at a rate manageable for R and shiny, you'd be OK (you may be in trouble to identify which R algorithms to use for this streaming use case, but this would deserve another question).

On the other hand if your process requires really low latency, much above and beyond what specified above, possibly you need to think to other types of tools and pipelines (e.g. using Apache Flink or consider ad hoc code).

Apologies for the very wordy explanation. Please let me know if it makes this complex topic any clearer.