Search code examples
rdata-manipulation

Scanning and Replacing Values of Rows in R


I have this dataset:

sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA",  "125 ocean road") )

   col1            col2              col3           col4
 james 123 forest road                NA             NA
 john           jason 124 valley street             NA
 henry             tim             peter 125 ocean road

I want to try and figure out a way in which the second column always contains the "address" - the final product would look this would look something like this:

# code to show sample of desired result
 desired_result = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "124 valley street", "125 ocean road"))

   col1              col2
 james   123 forest road
  john 124 valley street
 henry    125 ocean road

I have been trying to think of and research functions in R that are able to "scan" if the value contained within a row/column starts with a number, and make a decision accordingly.

I had the following idea - I can check to see if a given column starts with a number or not:

sample_data$is_col2_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col2,1,1))
sample_data$is_col3_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col3,1,1))
sample_data$is_col4_a_number = grepl("^[0-9]{1,}$", substr(sample_data$col4,1,1))

   col1            col2              col3           col4 is_col2_a_number is_col3_a_number is_col4_a_number
1 james 123 forest road                NA             NA             TRUE            FALSE            FALSE
2  john           jason 124 valley street             NA            FALSE             TRUE            FALSE
3 henry             tim             peter 125 ocean road            FALSE            FALSE             TRUE

Next, I would try to figure out how to code the following logic:

  • For a given row, find the first cell that contains the value TRUE
  • Keep the column corresponding to that condition

I tried this row-by-row:

first_row = sample_data[1,]

ifelse(first_row$is_col2_a_number == "TRUE", first_row[,c(1,2)], ifelse(first_row$is_col3_a_number, first_row[, c(1,3)], first_row[, c(1,4)]))

But I think I have made this unnecessarily complicated. Can someone please give me a hand and suggest how I can continue solving this problem?

Thank you!


Solution

  • This should work:

    library(dplyr)
    library(tidyr)
    library(stringr)
    sample_data = data.frame(col1 = c("james", "john", "henry"), col2 = c("123 forest road", "jason", "tim"), col3 = c("NA", "124 valley street", "peter"), col4 = c("NA", "NA",  "125 ocean road") )
    
    tmp <- sample_data %>% 
      mutate(across(col2:col4, ~case_when(str_detect(.x, "^\\d") ~ .x, 
                                          TRUE ~ NA_character_)), 
      address = coalesce(col2, col3, col4)) %>% 
      select(col1, address)
    tmp
    #>    col1           address
    #> 1 james   123 forest road
    #> 2  john 124 valley street
    #> 3 henry    125 ocean road
    

    Created on 2022-06-30 by the reprex package (v2.0.1)