Search code examples
rcsv-import

Using R to Split a CSV with multiple sets of data separated by headers and creating an identifying column


pretty new to R and i've come across a weird CSV my boss needs me to simplify for use.

I have a .csv that has multiple sets of data. Usually fine but each set of data is separated with a empty line and then the headers for the next set of data (the headers are always the same). Only problem is that there is no identifying column to let me just remove these rows. So for each set of data I need to remove the empty Row and Headers whilst giving each set a unique value in a new column. I hope this makes sense.

Honestly I've not got a clue handle this situation and I can't find any questions I could adapt. At least not at my current knowledge.

Any help will be greatly appreciated and might help me convince my boss to drop his mandraulic ways...

Example Image


Solution

  • Edited because I missed the desired "set" column..

    If your data is as you say it is from the picture you might be able to just... It worked for me on a small sample I built in excel made up to be a bit like yours. I'm assuming "Current" and "Desired" are not actually in the file and the first row starts with the header

    library(dplyr)
    
    yourdata <- read.csv("yourcsv.csv")
    
    yourdata %>% 
      mutate(Set = cumsum(!duplicated(Location == "Location"))) %>% 
      filter(Location != "Location", Location != "") 
    #>   Location Feature Height Volume Set
    #> 1     U104       3    104     99   1
    #> 2     U104       5    123    456   1
    #> 3     U104       6     45    999   2
    #> 4     U105       6     45    999   2
    

    Based on data

    structure(list(Location = c("U104", "U104", "", "Location", "U104", 
    "U105"), Feature = c("3", "5", "", "Feature", "6", "6"), Height = c("104", 
    "123", "", "Height", "45", "45"), Volume = c("99", "456", "", 
    "Volume", "999", "999")), class = "data.frame", row.names = c(NA, 
    -6L))