Search code examples
rprobability

The probability of a variable occurring more than once in a dataset


I am working on a dataset with roughly 38.000 observations on trending YouTube videos. One particular video can have multiple observations; meaning that one video can trend multiple times or for a longer period than one day.

The above is true, which we know, but I am trying to figure out how to calculate the probability of a video being observed more than one time in this dataset. P(X > 1)

Refer to the below image I have plotted with barplot(head(table(df$video_id))): enter image description here

We can tell that out of these 6 videos, 5 have more than one observation which equals a probability of 83.33%. How can I figure out the same on the whole dataset? While I am not necessarily trying to visualize it (which would be a bonus), I am simply curious on how to calculate the probability of a video_id occuring more than one time in the ~38.000 observations.

Here's a sample 20 observations: https://pastebin.com/Tx9ebH2c


Solution

  • You have most of what you need:

    tbl <- table(df$video_id)
    p <- sum(tbl > 1)/length(tbl)
    p
    # [1] 0.5
    

    For your sample data set half of the videos occur more than once. The length of the table is the number of different videos so dividing by that gives you the proportion of videos viewed more than once. You could do a simple bar plot to show the proportion of videos viewed more than once versus the proportion viewed only once.

    barplot(c(p, 1-p))