Search code examples
rtext-processingstring-parsing

R Cleaning and reordering names/serial numbers in data frame


Let's say I have a data frame as follows in R:

 Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
 Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
 Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
 Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
 Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
 Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")

What I want to do is the following:

  1. Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
  2. Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
  3. Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
  4. Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).

My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!


Solution

  • Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.

     seinfeld = lapply(1:nrow(Data), function(i) {
    
       # Turn strings into data frame with one name per row
       dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")), 
                  Year=Data[i,"Year"],
                  Name=unlist(strsplit(Data[i,"Name"], split="\n")))
    
       # Get rid of empty strings and NA values
       dat = dat[!(dat$Name %in% c(""," ","NA")), ]
    
       # Order alphabetically
       dat = dat[order(dat$Name), ]
     })
    

    UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:

    seinfeld = lapply(1:nrow(Data), function(i) {
    
      # Turn strings into data frame with one name per row
      dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")), 
                       Name=unlist(strsplit(Data[i,"Name"], split="\n")))
    
      # Get rid of empty strings and NA values
      dat = dat[!(dat$Name %in% c(""," ","NA")), ]
    
      # Order alphabetically
      dat = dat[order(dat$Name), ]
    
      # Collapse back into a single row with the new sort order
      dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
                       Year=Data[i, "Year"],
                       Name=paste(dat[, "Name"], collapse="\n"))
    
    })
    
    do.call(rbind, seinfeld)
    
               SerialNum Year                          Name
    1      837\n983\n424 2015          Lewis\nMichael\nPaul
    2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
    3 321\n987\n654\n975 2010     George\nJohn\nPaul\nRingo
    4      837\n983\n424 2015          Lewis\nMichael\nPaul
    5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer