Search code examples
rdataframeformattingdata-cleaningdata-import

Correct offset rows after importing data in R


Problem: For some reason, my software uses spaces to separate data, but also used spaces in words or phrases that should not count as separate cells, therefore making my data start at irregular columns. It seems simple enough, I'm sure someone else posted about this particular problem, but I can't find the proper terms to articulate my problem clearly enough to find the other posts.

Example data:

bad <- data.frame(c("Block","NA","NA","Block","NA","NA"),
                  c("1:","image2","image3","2:","image5","image6"),
                  c("image1","NA","NA","image4","NA","NA"))

Current output:

names(bad) <- NULL
print(bad)
1 Block     1: image1
2    NA image2     NA
3    NA image3     NA
4 Block     2: image4
5    NA image5     NA
6    NA image6     NA

Desired output:

1 Block 1: image1
2       NA image2
3       NA image3
4 Block 2: image4
5       NA image5
6       NA image6
7 #From 3 to 2 columns

Question: What's the most efficient way to accomplish this?

What I've tried/thought about: 1) The cleanme function from data science (but it just retains the lines of data where there's a "Block" string and eliminate the other lines); 2) Not sure how to go about this one but basically using the gsub function to replace every value of column 2 that contains [1:5] followed by ":" with "Block [1:5]:", and then moving that whole row to the left (but the problem is that I also have rows with dates that I don't want to collapse that may start in the exact same way); 3) Even if I use the gsub function to replace one string by another, I still have to collapse the columns, which I could do with the paste function, but again, I only want to collapse or replace the first two columns of every row that start with a "Block" string—and I'm not sure of the syntax specifications to combine all this, or whether I'm really just complicating my life for nothing.

Note: I've done tutorials, etc., but I can't find a way to specifically do this. Please refer me to the right post/duplicate and delete this one if it already exists. Thanks!

Update

Slightly adapting earch's answer below to my actual data structure, I was able to find a working solution (my data set is more complicated than my example). For reference:

# Continuing from example above
bad <- as.matrix(bad) # Note that I didn't need this step for my actual data but needed here

good <- lapply(1:nrow(bad), function(i) bad[i, !is.na(bad[i, ])]) # Transforms rows into lists

good <- lapply(good, function(x) {
  if (x[1] == "Block") { # If the row starts with the word "Block", then do the following:
    c(paste(x[1:2], collapse = " "), x[3:length(x)]) # Paste the first two cells collapsed together (so Block + the block number belonging to the next cell) while adding the remaining row cells
  } else {
    c(x) # Just put the row in a list (didn't worked without this step)
  }
})

good <- do.call(rbind, good) # Binds elements from list together
good <- as.data.frame(good) # Puts everything nicely in a neat dataframe
good

        V1     V2       V3
1 Block 1: image1 Block 1:
2       NA image2       NA
3       NA image3       NA
4 Block 2: image4 Block 2:
5       NA image5       NA
6       NA image6       NA

As you can see, there still is a slight problem with my modification using this sample data set, which is that "Block 2:" repeats itself next, but that's not too much of a problem and at least the actual data is aligned. In my actual data, there were many more columns, so that didn't occur, and using this solution, the third and next columns actually contained reaction times and other information.


Solution

  • Howabout the following? I wasn't sure if you wanted "NA" to be actual NA or the string, but you can modify the code below to be either. I also wasn't sure if the desired effect was that the new data.frame have two columns or if there was some situation where it should be more (or less). I assumed the former.

    > bad <- data.frame(
    +   c("Block","NA","NA","Block","NA","NA"),
    +   c("1:","image2","image3","2:","image5","image6"),
    +   c("image1","NA","NA","image4","NA","NA")
    + )
    > names(bad) <- NULL
    > bad
    
    1 Block     1: image1
    2    NA image2     NA
    3    NA image3     NA
    4 Block     2: image4
    5    NA image5     NA
    6    NA image6     NA
    > 
    > bad <- as.matrix(bad)
    > bad[bad == "NA"] <- NA
    > 
    > good <- lapply(1:nrow(bad), function(i) bad[i, !is.na(bad[i, ])])
    > good <- lapply(good, function(x) {
    +   if(length(x) == 1) {
    +     c(NA, x)
    +   } else {
    +     c(paste(x[1:(length(x) - 1)], collapse = " "), x[length(x)])
    +   }
    + })
    > good <- do.call(rbind, good)
    > good <- as.data.frame(good)
    > good
            V1     V2
    1 Block 1: image1
    2     <NA> image2
    3     <NA> image3
    4 Block 2: image4
    5     <NA> image5
    6     <NA> image6