I need help with replacing or extracting string of numbers, separated by comma in each element of my df, and replacing it with the median. For example,
a <- c("3, 3, 5, 5", "7, 7, 5, 5", "3, 4, 4, 5", "5, 7")
b <- c("Karina", "Eva", "Jake", "Ana")
df <- data.frame(b,a)
Now i need to replace variable a with the median of those numbers contained in each elements so it looks like below:
b a
1 Karina 4
2 Eva 6
3 Jake 4
4 Ana 6
Little bit background. Each number is actually a length of a word that belongs to the corresponding name. I need to find median length for each name and figure out whether names that start with a vowel have longer median length or not. So for example, from the above i will conclude that names that start with vowel have shorted length. And to use a test to show that it is statistically significant. If someone can guide me in any way, i really appreciate it!
We can split the 'a' column with strsplit
on ,
followed by zero or more spaces (\\s*
), loop over the list
, convert to numeric
and get the median
, assign it to same column
df$a <- sapply(strsplit(df$a, ",\\s*"), function(x) median(as.numeric(x)))
df$a
#[1] 4 6 4 6
Or using tidyverse
, we can use separate_rows
to split the 'a' column and expand the rows while converting the type', then do a group by median
library(dplyr)
library(tidyr)
df %>%
separate_rows(a, convert = TRUE) %>%
group_by(b) %>%
summarise(a = median(a))