Search code examples
rstringmultiple-columnsdata-cleaning

How to delete text that does not start with a certain amount of numbers in dataframe column


I have this:

col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US

And I need it to be this:

col1     col2    col3
1234HO   1234    HO
9535KU   9535    KU
4532SP   4532    SP
                       #these rows still need to be there


1206     1206          #keep in mind that I still want to keep this if there is 4 numbers
9530OK   9530    OK

8524US   8524    US

I tried removing it manually, but it's a bit too much work. I am not sure how to make a function that says "delete all text that does not start with 4 numbers". I would only know how to do it if they were all the same numbers, but they can be any numbers.


Solution

  • Here is a dplyr approach with very low level of regex.

    Input

    # your df
    df <- read.table(header = T, text = "
    col1
    1234HO
    9535KU
    4532SP
    1
    hello
    xyz
    1206
    9530OK
    23
    8524US")
    

    Empty rows

    library(dplyr)
    
    df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"), 
                  col3 = str_extract(col1, "[A-Z].*$"),
                  col3 = replace_na(col3, ""),
                  across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, print(""))))
    
         col1 col2 col3
    1  1234HO 1234   HO
    2  9535KU 9535   KU
    3  4532SP 4532   SP
    4                  
    5                  
    6                  
    7    1206 1206     
    8  9530OK 9530   OK
    9                  
    10 8524US 8524   US
    

    Rows as NA

    # if you want them to be filled with NA
    df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"), 
                  col3 = str_extract(col1, "[A-Z].*$"),
                  across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, NA)))
    
         col1 col2 col3
    1  1234HO 1234   HO
    2  9535KU 9535   KU
    3  4532SP 4532   SP
    4    <NA> <NA> <NA>
    5    <NA> <NA> <NA>
    6    <NA> <NA> <NA>
    7    1206 1206 <NA>
    8  9530OK 9530   OK
    9    <NA> <NA> <NA>
    10 8524US 8524   US