I have the following dataset:
id = 1:5
col1 = c("john", "henry", "adam", "jenna", "peter")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("matt", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("Phone: 111 1111 111", "Phone: 222 2222", "Phone: 333 333 1113", "Phone: 444 111 1153", "Phone: 111 111 1121")
my_data = data.frame(id, col1, col2, col3, col4)
id col1 col2 col3 col4
1 1 john river B8C 9L4 matt Phone: 111 1111 111
2 2 henry Field U9H 5E2 PP steve Phone: 222 2222
3 3 adam NA forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd NA Phone: 444 111 1153
5 5 peter dave NA Phone: 111 111 1121
I found this REGEX code that recognizes the following pattern - this can then be wrapped into a function:
apply(my_data, 1, function(x) gsub('(([A-Z] ?[0-9]){3})|.', '\\1', toString(x)))
[1] "B8C 9L4" "U9H 5E2" "K0Y 1U9" "A1B 5H1" ""
Once this has been done, is there any way to extend this code such that
For example this, would then look like this:
[1] "river B8C 9L4 " Field U9H 5E2 PP" "forest K0Y 1U9 hu2" "ocean A1B 5H1 dd"
An option will be to loop over the rows, subset the elements that are not a "NA"
or those having substring "Phone", then subset those having more than one word (str_count
)
library(stringr)
na.omit(apply(my_data[-1], 1, \(x)
{x <- x[x != "NA"]
x1 <- x[(!str_detect(x, "Phone"))]
x1[str_count(x1, "\\w+") > 1][1]
})
-output
[1] "river B8C 9L4" "Field U9H 5E2 PP"
[3] "forest K0Y 1U9 hu2" "ocean A1B 5H1 dd"