Search code examples
rcsvbatch-processing

Importing CSV to r and remove the rows of notes both at begining and middle


I have several csv files recorded by air sensor (TSI Bluesky and AirAssure). This device records the data to its SD card. As with many machine-recorded files, the first 59 lines are notes that start with # to record basic information like serial numbers. These notes are easy to skip by adding skip=59. However, these notes could appear in the middle of the csv files by breaking the record. Meanwhile, the column names will repeat again. I have an example below.

#note
#note
#note
#note
col1 col2 col3
unit1 unit2 unit3
1 2 3
1 2 3
1 2 3
#note
#note
#note
#note
col1 col2 col3
unit1 unit2 unit3
1 2 3
1 2 3
1 2 3

How can I skip all the note and unit and only keep one column name and all the numbers?


Solution

  • This code reads data from text, so if you are loading the csv file from some a folder, please check that the separator is "\t" or " "

    The comment.char parameter filters the notes: #note

    text <- 
    "
    #note       
    #note       
    #note       
    #note       
    col1    col2    col3
    unit1   unit2   unit3
    1   2   3
    1   2   3
    1   2   3
    #note       
    #note       
    #note       
    #note       
    col1    col2    col3
    unit1   unit2   unit3
    1   2   3
    1   2   3
    1   2   3
    "
    library(dplyr)
    
    df <- read.csv(text = text, comment.char = "#", sep = "\t")
    filter(df, !col1 %in% c('col1', 'unit1'))
    

    Output:

    col1 col2 col3
    1    1    2    3
    2    1    2    3
    3    1    2    3
    4    1    2    3
    5    1    2    3
    6    1    2    3