Search code examples
rstringnadata-cleaningpostal-code

How replace all cases in columns with NA if there are more than x numbers OR more than x letters in the string?


I have data that looks a bit like this

col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"

I want to keep everything that

  • is only 4 numbers
  • is only 4 numbers and two letters with any combination of spaces

I currently have this code:

df$col2 <- gsub('\\s+', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>% 
  mutate(col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z].*$"),
         across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))

I want this result:

col1    col2       col3   col4
1       "1042AZ"   1042   "AZ"
2       "9523PA"   9523   "PA"
3       NA         NA     NA
4       NA         NA     NA
5       NA         NA     NA
6       NA         NA     NA
7       "1052"     1052   NA

Problem is that the number in line 5 stays after running my code.


Solution

  • Following your code, you can set to NA if col3 does not have 4 characters:

    df %>% 
      mutate(col2 = gsub('\\s+', '', toupper(col2)),
             col3 = str_extract(col2, "^[0-9]{4,}"), 
             col4 = str_extract(col2, "[A-Z|a-z].*$"),
             across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))
    
      col1   col2 col3 col4
    1    1 1042AZ 1042   AZ
    2    2 9523PA 9523   PA
    3    3   <NA> <NA> <NA>
    4    4   <NA> <NA> <NA>
    5    5   <NA> <NA> <NA>
    6    6   <NA> <NA> <NA>
    7    7   1052 1052 <NA>
    

    data

    df <- read.table(header = T, text = 'col1   col2
    1      "1042AZ"
    2      "9523 pa"
    3      "dog"
    4      "New York"
    5      "20000 (usa)"
    6      "Outside the country"
    7      "1052"')