Search code examples
rdataframestatisticsprobability

Probability of an event occurring at one time of day vs another in R


I have a set of events that occur in the morning and in the afternoon, and would like to calculate the probability of each occurring in the morning vs the afternoon.

i.e -

P = number of outcomes/total number of potential outcomes

For example the number of Aggression events in the table below that occurred in the morning vs the afternoon would be:

p morning = 3468/3468+4658 = 0.4678

p afternoon = 1 - p morning = 1 - 0.4678 = 0.5322

Event_Time  Event_Name  Num_of_Occurances
Morning     Aggression  3468
Afternoon   Aggression  4658
Morning     SIB         900
Afternoon   SIB         1500
Morning     Elopement   400
Afternoon   Elopement   234
Morning     Pica        786
Afternoon   Pica        1234
Morning     Stereotypy  234
Afternoon   Stereotypy  633
Morning     Disruptive  534
Afternoon   Disruptive  780

I'm trying to find the best way to do this in R, I know I could pivot the table wide and add a column with the calculation though I'm wondering if prop.table or another function can handle more efficiently.


Solution

  • You can create a small function to make the calculations, and apply it by group:

    library(data.table)
    
    f <- \(e,t) {
      pm = e[t=="Morning"]/sum(e)
      return(list(p_morning=pm, p_afternoon=1-pm))
    }
    
    setDT(dt)[, f(Num_of_Occurances,Event_Time), Event_Name ]
    

    Output:

       Event_Name p_morning p_afternoon
    1: Aggression 0.4267782   0.5732218
    2:        SIB 0.3750000   0.6250000
    3:  Elopement 0.6309148   0.3690852
    4:       Pica 0.3891089   0.6108911
    5: Stereotypy 0.2698962   0.7301038
    6: Disruptive 0.4063927   0.5936073
    

    Input:

    structure(list(Event_Time = c("Morning", "Afternoon", "Morning", 
    "Afternoon", "Morning", "Afternoon", "Morning", "Afternoon", 
    "Morning", "Afternoon", "Morning", "Afternoon"), Event_Name = c("Aggression", 
    "Aggression", "SIB", "SIB", "Elopement", "Elopement", "Pica", 
    "Pica", "Stereotypy", "Stereotypy", "Disruptive", "Disruptive"
    ), Num_of_Occurances = c(3468L, 4658L, 900L, 1500L, 400L, 234L, 
    786L, 1234L, 234L, 633L, 534L, 780L)), row.names = c(NA, -12L
    ), class = "data.frame")
    

    Of course, you don't need the function.Here is an alternative without a helper function, and this time illustrating dplyr, instead of data.table

    library(dplyr)
    
    reframe(dt, p_morning=Num_of_Occurances[Event_Time=="Morning"]/sum(Num_of_Occurances), .by=Event_Name) %>% 
      mutate(p_afternoon = 1-p_morning)
    

    Output:

      Event_Name p_morning p_afternoon
    1 Aggression 0.4267782   0.5732218
    2        SIB 0.3750000   0.6250000
    3  Elopement 0.6309148   0.3690852
    4       Pica 0.3891089   0.6108911
    5 Stereotypy 0.2698962   0.7301038
    6 Disruptive 0.4063927   0.5936073