Search code examples
rreshape2

How would you eliminate a sequence of events after x events had occured in this data table?


I have a data.frame that gives the time for an event to occur for each x and y co-ordinate across different simulations. I've attached the head of this table below using dput().

head_data<-structure(list(x = c(987.353265152362, 570.817987386894, 1147.5681499552, 
637.526076016409, 1439.13510253106, 1396.6452808061), y = c(1802.08232812874, 
349.336242713164, 1789.49467712533, 361.611973188148, 1492.44148360367, 
1459.91771610835), id = 1:6, `simulation 1` = c(1100, 600, 1200, 
400, 900, 1000), `simulation 2` = c(1500, 1400, 1600, 1200, 1200, 
1300), `simulation 3` = c(1200, 1100, 1200, 1000, 900, 900), 
    `simulation 4` = c(1300, 800, 1200, 900, 1100, 1100), `simulation 5` = c(1500, 
    1200, 1400, 1100, 1300, 1200), `simulation 6` = c(200, 1400, 
    100, 1100, 600, 600)), row.names = c(NA, 6L), class = "data.frame")

First melting the data into a long format

data_long <- melt(head_data, id.vars = c('x', 'y', 'id'), value.name = 'time', variable.name = 'sim')

I then sort the time of events

times <- sort(unique(data_long$time))

Now I translate this data.frame of events into total prevalence by summing up the events at every time interval for each sim.

data_clust_10 <- data_long %>% group_by(sim) %>%
  do(data.frame(time=times, infected=sapply(times, function(x) sum(.$time <= x))))

I then filter the prevalence data by removing all events after a threshold number, there are 1000 x and y co-ordinates in the full data, but we are working with only 6 individuals per sim, so let's say 2 events.

data_clust_10_cut<-filter(data_clust_10, infected < 2)

Is it possible to translate this data frame back into the original format of head_data? Can I use dcast()? I think that the number of rows will be different so it will not work or am I wrong? I want to do this because I am going to use the refined data to estimate the coefficient of variance. Thinking this through I actually think that arranging the times in the original table, head_data, and then eliminating the number of events that are above the count of my prevalence threshold may be the best solution, but I would be interested to learn if dcast() was possible to use in this scenario. I want the sequence of events in the original table to be ordered according to the time it took for the event to occur, and then I want to eliminate all events that occurred after x of total events occurred, regardless of time.


Solution

  • Is this what you're looking for:

    library(tidyr)
    data_clust_10_cut %>% pivot_wider(names_from="sim", values_from="infected")
    # # A tibble: 8 x 7
    #     time `simulation 1` `simulation 2` `simulation 3` `simulation 4`
    #    <dbl>          <int>          <int>          <int>          <int>
    # 1   100              0              0              0              0
    # 2   200              0              0              0              0
    # 3   400              1              0              0              0
    # 4   600             NA              0              0              0
    # 5   800             NA              0              0              1
    # 6   900             NA              0             NA             NA
    # 7  1000             NA              0             NA             NA
    # 8  1100             NA              0             NA             NA
    # # … with 2 more variables: `simulation 5` <int>, `simulation 6` <int>
    #