Search code examples
rregexdataframestrsplit

String split rows of dataframe based on pattern into new columns


I have a data frame which looks like this

df

Country     Year  col3. col4. col5
USA2018      10    50    13   NA
UK 2018       4    12     6   NA
China       2018   15     4    1
Malta        NA    2018  25    8

I would like to split the strings of "Country" column by the pattern "2018" for the rows where 2018 is merged into the first column and shifted to the Year column for the rows where the Year is NA and have this output:

df

Country     Year  col3. col4. col5
USA         2018   10    50    13   
UK          2018   4    12     6   
China       2018   15    4    1
Malta       2018   25    8     NA

Any suggestion?


Edit: This data is result of PDF scraping.Link to PDF, and the code below:

# install.packages("pdftools") 
# install.packages("readr") 
library(pdftools) 
library(readr) 


epi <- pdf_text("malaria_epi.pdf") 
epi_df <- epi %>% 
  read_lines() %>% 
  grep('^\\s{2}\\w', ., value = TRUE) %>% 
  paste(collapse = '\n') %>% read_fwf(fwf_empty(.)) 

Solution

  • Here is a solution. It was a bit tricky, but I think it cover your case. This solution could be problematic if you have NA in the middle of a row, but I didn't manage to find a better way yet.

    df <- read.table(header=TRUE, 
                     text="
    Country     Year  col3. col4. col5
    USA2018      10    50    13   NA
    UK2018       4    12     6   NA
    China       2018   15     4    1
    Malta        NA    2018  25    8")
    
    tmpN <- names(df) # save the colnames
    df = cbind(df[,1],df) # duplicate the first column
    df[,c(1,2)] <- lapply(df[,c(1,2)], as.character)
    
    df[,1] = sub('[[:digit:]]+','',df[,1]) # remove date in first column
    df[,2] = sub('[[:alpha:]]+','',df[,2]) # remove city in second column
    df[df==''] <- NA # replace empty cells with NA
    
    # push all NA to the right side
    df2 = as.data.frame(t(apply(df,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
    df2 <- df2[,!(colSums(is.na(df2))==nrow(df2))] # remove column full of NA
    
    colnames(df2) <- tmpN # replace colnames