I have this:
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US
And I need it to be this:
col1 col2 col3
1234HO 1234 HO
9535KU 9535 KU
4532SP 4532 SP
#these rows still need to be there
1206 1206 #keep in mind that I still want to keep this if there is 4 numbers
9530OK 9530 OK
8524US 8524 US
I tried removing it manually, but it's a bit too much work. I am not sure how to make a function that says "delete all text that does not start with 4 numbers". I would only know how to do it if they were all the same numbers, but they can be any numbers.
Here is a dplyr
approach with very low level of regex.
# your df
df <- read.table(header = T, text = "
col1
1234HO
9535KU
4532SP
1
hello
xyz
1206
9530OK
23
8524US")
library(dplyr)
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
col3 = replace_na(col3, ""),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, print(""))))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4
5
6
7 1206 1206
8 9530OK 9530 OK
9
10 8524US 8524 US
# if you want them to be filled with NA
df %>% mutate(col2 = str_extract(col1, "^[0-9]{4,}"),
col3 = str_extract(col1, "[A-Z].*$"),
across(everything(), ~ifelse(grepl("^[0-9]{4}", col1), .x, NA)))
col1 col2 col3
1 1234HO 1234 HO
2 9535KU 9535 KU
3 4532SP 4532 SP
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
7 1206 1206 <NA>
8 9530OK 9530 OK
9 <NA> <NA> <NA>
10 8524US 8524 US