Search code examples
rstringmedianstatistical-test

Replace strings of numbers separated by commas with the median in R


I need help with replacing or extracting string of numbers, separated by comma in each element of my df, and replacing it with the median. For example,

a <- c("3, 3, 5, 5", "7, 7, 5, 5", "3, 4, 4, 5", "5, 7")
b <- c("Karina", "Eva", "Jake", "Ana")
df <- data.frame(b,a)

Now i need to replace variable a with the median of those numbers contained in each elements so it looks like below:

        b    a
1 Karina     4
2 Eva        6
3 Jake       4
4 Ana        6

Little bit background. Each number is actually a length of a word that belongs to the corresponding name. I need to find median length for each name and figure out whether names that start with a vowel have longer median length or not. So for example, from the above i will conclude that names that start with vowel have shorted length. And to use a test to show that it is statistically significant. If someone can guide me in any way, i really appreciate it!


Solution

  • We can split the 'a' column with strsplit on , followed by zero or more spaces (\\s*), loop over the list, convert to numeric and get the median, assign it to same column

    df$a <- sapply(strsplit(df$a, ",\\s*"), function(x) median(as.numeric(x)))
    df$a
    #[1] 4 6 4 6
    

    Or using tidyverse, we can use separate_rows to split the 'a' column and expand the rows while converting the type', then do a group by median

    library(dplyr)
    library(tidyr)
    df %>% 
         separate_rows(a, convert = TRUE) %>%
         group_by(b) %>% 
         summarise(a = median(a))