Search code examples
rdataframedata-cleaningsurveycoercion

R: Change '5-10' character to 7.5 numeric in dataframe row, NAs introduced by coercion


I have survey data of minutes to complete a journey in a dataframe, of character type. Some entries are written as a range, e.g. '5-10'. I want to change these entries to the mean of the range.

My data looks like this.

 [1] "30"                         "15"                        
 [3] "30"                         "15 Minutes "               
 [5] "15"                         "20 mins "                  
 [7] "30"                         "half an hour to 40 minutes"
 [9] "30"                         "40"                        
[11] "20"                         "30"                        
[13] "15"                         "20"                        
[15] "40"                         "20"                        
[17] "40"                         "30"                        
[19] "15"                         "15"                        
[21] "20"                         "30mins"                    
[23] "20"                         "20"                        
[25] "15"                         "40"                        
[27] "15"                         "25"                        
[29] "30"                         "20"                        
[31] "Depends where you live"     "30-45"                     
[33] "30 min "                    "20"                        
[35] "30"                         "20"                        
[37] "60 minutes"                 "30 mins"                   
[39] "15"                         "10"                        
[41] "20"                         "40"                        
[43] "60"                         "60"                        
[45] "30"                         "49"                        
[47] "50 minutes"                 "20 minutes"                
[49] "90"                         "7-10 minutes "             
[51] "15-20"                      "25 minutes"                
[53] "25"                         "45"                        
[55] "60 minutes "                "2-4 hours"                 
[57] "30"                         "30 min"                    
[59] "20"                         "30"                        
[61] "20"                         "25"                        
[63] "2-4hrs"                     "30"                        
[65] "45"                         "45"                        
[67] "75"                         "20"                        
[69] "60"                         "45mins"                    
[71] "60"                         "20"

I have tried the following code:

data <- data %>% mutate(
  est_time = case_when(
    grepl('-', est_time) ~  mean(as.numeric(unlist(str_split(est_time, '-'))))
))
data <- data %>% mutate(
  est_time = ifelse(
    grepl('-', est_time),
    mean(as.numeric(unlist(str_split(est_time, '-')))),
    est_time)
  )

Each time, I recieve:

Warning message:
Problem while computing `est_time = case_when(...)`.
ℹ NAs introduced by coercion 

I suspect this may be because the unlist function spreads the list data over multiple rows.

How can I resolve this and achieve my aim?


Solution

  • Instead of using case_when or ifelse, an option is to select only the rows having -, read with read.table, get the rowMeans and assign it back

    i1 <- grepl('^(\\d+)-(\\d+)$', data$est_time)
    data$est_time[i1] <- rowMeans(read.table(text = data$est_time[i1], 
         sep = '-', header = FALSE), na.rm = TRUE)
    

    In case, we want to take the mean of all those entries having the - (i.e. including all those entries like 2-4 hours or 7-10 minutes in addition to 15-20)

    library(stringr)
    library(dplyr)
    data %>%
       mutate(est_time2 = str_replace_all(est_time, "(\\d+-\\d+)", 
       function(x)  mean(scan(text = x, what = numeric(), 
            sep = '-', quiet = TRUE))))
    

    -output

    # A tibble: 9 × 2
      est_time                 est_time2               
      <chr>                    <chr>                   
    1 "15 Minutes"             "15 Minutes"            
    2 "20 mins"                "20 mins"               
    3 "40"                     "40"                    
    4 "15"                     "15"                    
    5 "Depends where you live" "Depends where you live"
    6 "7-10 minutes "          "8.5 minutes "          
    7 "15-20"                  "17.5"                  
    8 "2-4 hours"              "3 hours"               
    9 "30 min"                 "30 min"    
    

    data

    data <- structure(list(est_time = c("15 Minutes", "20 mins", "40", "15", 
    "Depends where you live", "7-10 minutes ", "15-20", "2-4 hours", 
    "30 min")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -9L))