How can I extract characters from one column and add it into another existing column in R?

I would like to extract information from one column and substitute it in another existing column. So this is what happened. I have a variable that identifies country_year, that I divided into two columns, country and year. So, just for example:

id           Country    year   
AUS_1999     AUS        1999    
CAN_1999     CAN        1999    
AUS_2000     AUS        2000     
CAN_2000     CAN        2000    
BELS1999     BELS1999   NA

In the example, notice that the fifth observation was not separated by an "_", because the code that I used to separate the id column. It ended up with a missing value in the year column and the country column is also wrong. There are a few of these observations in my data frame. How can I correct it for all these observations, by extracting information from the id column and adding to existing columns that I already created (country and year)?

I tried to be as clear as possible, let me know if you need more information.

Solution

We could use a regex lookaround to separate the column 'id'

library(dplyr)
library(tidyr)
df1 %>%
   separate(id, into = c("Country", "year"), 
       sep = "_|(?<=[A-Z])(?=\\d)", remove = FALSE)

-output

        id Country year
1 AUS_1999     AUS 1999
2 CAN_1999     CAN 1999
3 AUS_2000     AUS 2000
4 CAN_2000     CAN 2000
5 BELS1999    BELS 1999

Or with extract

df1 %>% 
  extract(id, into = c("Country", "year"), "^([A-Z]+)_?(\\d+)", remove = FALSE)
        id Country year
1 AUS_1999     AUS 1999
2 CAN_1999     CAN 1999
3 AUS_2000     AUS 2000
4 CAN_2000     CAN 2000
5 BELS1999    BELS 1999

Or in base R, insert a _ where there are none between the uppercase letter and a digit to read it with read.table into two columns

cbind(df1, read.table(text = sub("([A-Z])(\\d)", "\\1_\\2", df1$id), 
   header = FALSE, sep = "_", col.names = c("Country", "year")))

-output

        id Country year
1 AUS_1999     AUS 1999
2 CAN_1999     CAN 1999
3 AUS_2000     AUS 2000
4 CAN_2000     CAN 2000
5 BELS1999    BELS 1999

data

df1 <- structure(list(id = c("AUS_1999", "CAN_1999", "AUS_2000", "CAN_2000", 
"BELS1999")), row.names = c(NA, -5L), class = "data.frame")