Search code examples
rdataframedplyrfrequency

R code - Why is my frequency table giving me wrong percentage numbers? I have reproducible code below


I have the below df. They are frequency counts:

pnb3 <- structure(list(Likelihood.to.Click.Freq = c(29L, 71L, 120L), 
    Likelihood.to.Enroll.Freq = c(30L, 84L, 106L), Likelihood.to.Click.1.Freq = c(54L, 
    90L, 108L), Likelihood.to.Enroll.1.Freq = c(55L, 109L, 88L
    ), Likelihood.to.Click_0.Freq = c(50L, 77L, 86L), Likelihood.to.Enroll_0.Freq = c(49L, 
    93L, 71L), Likelihood.to.Click_1.Freq = c(25L, 63L, 163L), 
    Likelihood.to.Enroll._0.Freq = c(26L, 90L, 135L), Likelihood.to.Click_2.Freq = c(63L, 
    74L, 94L), Likelihood.to.Enroll_1.Freq = c(61L, 95L, 75L), 
    Likelihood.to.Click_3.Freq = c(22L, 51L, 157L), Likelihood.to.Enroll._1.Freq = c(24L, 
    93L, 113L), Likelihood.to.Click_4.Freq = c(42L, 66L, 118L
    ), Likelihood.to.Enroll._2.Freq = c(39L, 90L, 97L), Likelihood.to.Click_5.Freq = c(25L, 
    47L, 157L), Likelihood.to.Enroll_2.Freq = c(26L, 75L, 128L
    ), Likelihood.to.Click_6.Freq = c(42L, 84L, 96L), Likelihood.to.Enroll_3.Freq = c(38L, 
    103L, 81L), Likelihood.to.Click_7.Freq = c(30L, 69L, 105L
    ), Likelihood.to.Enroll_4.Freq = c(28L, 88L, 88L), Likelihood.to.Click_8.Freq = c(29L, 
    57L, 140L), Likelihood.to.Enroll_5.Freq = c(27L, 90L, 109L
    ), Likelihood.to.Click_9.Freq = c(40L, 70L, 109L), Likelihood.to.Enroll_6.Freq = c(34L, 
    94L, 91L), Likelihood.to.Click_10.Freq = c(31L, 75L, 135L
    ), Likelihood.to.Enroll_7.Freq = c(32L, 93L, 116L)), class = "data.frame", row.names = c(NA, 
-3L))

but when I try to change the counts to %. The last row is incorrect. It should be ~54/55 percent. But I am getting ~47/48 percent. I dont think its a rounding error as its off by quite a bit. Basically in each set of outputs one number comes out incorrect.

Here is the code I use to change frequency counts to percentage. Is there anything wrong with it? I know theres ways to use a function but I wanted to break it down to see each step:

pnb4 <- pnb3 / (colSums(pnb3))
pnb5 <- pnb4 *100
pnb6 <- round(pnb5,1)

If you run it you'll notice the third % is off by quite a bit.

UPDATE: for example once I run the above the first output gives me this

enter image description here

but the third row should actually be 54% (because 120/220 = 54%)


Solution

  • The problem is that your code isn't vectorized in the way you want it to be. What your code does it takes the first value of column 1 and divides it by the colSum for column 1. Then it takes the second row for column 1 and divides it by the colSum for column 2 (which still is correct because both colsums are the same). But when you get to the third row, it divides by teh colsum for col 3 (i.e. 252) and that is not correct.

    You can do:

    library(dplyr)
    pnb3 %>%
      mutate(across(everything(), ~round(./sum(.)*100, 1)))
    

    Here's the result for the first few columns:

    # A tibble: 3 x 26
      Likelihood.to.C~ Likelihood.to.E~ Likelihood.to.C~ Likelihood.to.E~
                 <dbl>            <dbl>            <dbl>            <dbl>
    1             13.2             13.6             21.4             21.8
    2             32.3             38.2             35.7             43.3
    3             54.5             48.2             42.9             34.9