Search code examples
rdataframesortingsum

Summing ranks for variable with fewest entries


I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.

library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
  Alaska  = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA), 
  Calif   = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>% 
  pivot_longer(
    cols = c("Alaska", "Calif"),
    names_to = "State",
    values_to = "Value",
    values_drop_na = TRUE
  ) %>% 
  mutate(StateRank = rank(Value, ties.method = "average"))

# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
  a.df <- arrange(a.df, desc(State))
} else {
  a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------

nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))

a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2               # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12           # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2), 
                                mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that 
# the answer is the same as in the textbook that I use)

Solution

  • Please try this code and tell me if I am on the right way:

    I replaced your so called clumsy code with this one

      ...  %>%
      group_by(State) %>%  
      mutate(mx = max(Value)) %>% 
      arrange(desc(mx), desc(Value)) %>% 
      select(-mx)
    

    The whole code:

    library(tidyverse)
    # Activity 9: aboriginal village size in Alaska and California
    a.df <- data.frame(
      Alaska  = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA), 
      Calif   = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
    ) %>% 
      pivot_longer(
        cols = c("Alaska", "Calif"),
        names_to = "State",
        values_to = "Value",
        values_drop_na = TRUE
      ) %>% 
      mutate(StateRank = rank(Value, ties.method = "average")) %>% 
      group_by(State) %>%  
      mutate(mx = max(Value)) %>% 
      arrange(desc(mx), desc(Value)) %>% 
      select(-mx)
      
    -----------------------------------------------------------------------------
    
    a.U <- sum(a.df$StateRank[1:nA])
    a.E <- (nA*(nA+nB+1))/2               # Expectation of U
    a.V <- (nA*nB*(nA+nB+1))/12           # Variance of U
    a.Z <- (a.U - a.E)/sqrt(a.V)
    a.P <- round((1 - round(pnorm(round(abs(a.Z), 2), 
                                    mean = 0, sd = 1) ,4)) * 2, 3)
    # all the rounding is to mimic statistical tables (so that 
    # the answer is the same as in the textbook that I use)