Search code examples
rheaderdrop

Drop Columns columns that have been filled in


I'm loading a lot of data into R from CSVs. The CSVs contain columns with empty headers, and I would like to drop those columns from my data set. I've done some searching and I haven't found a good way to do this. R fills the columns with X2, X3, etc. I don't want to remove based on X, because they might drop some columns that I need. I have about 200 CVS that I need to load here.

Any help would be greatly appreciated.

Here's the text file of a sample csv I used to work through this. In my actual data, the blank columns don't always occur in every other cell.

Name,,Ages,,Color,,Year
Michael,some data,2,some other data,Blue,a third data,2001
Tiffany,some data,3,some other data,Red,a third data,2002
Bryan,some data,4,some other data,Green,a third data,2003
Sarah,some data,5,some other data,Orange,a third data,2004

Here's what I'd like to get back.

Name,Ages,Color,Year
Michael,2,Blue,2001
Tiffany,3,Red,2002
Bryan,4,Green,2003
Sarah,5,Orange,2004

Solution

  • We can create a regex to match column names that starts (^) with X followed by . or digits till the end ($) of the string in grep, specify invert = TRUE for negating

    i1 <- grep('^X([0-9.]+)?$', names(df1), invert = TRUE)
    df1[i1]
    

    -output

    #      Name Ages  Color Year
    #1 Michael    2   Blue 2001
    #2 Tiffany    3    Red 2002
    #3   Bryan    4  Green 2003
    #4   Sarah    5 Orange 2004
    

    Or using dplyr

    library(dplyr)
    df1 %>% 
          select(!matches('X([0-9.]+)?$'))
    

    If we have a lots of files to read from the working directory, read it in a list, and do the same

    files <- list.files(pattern = '\\.csv$', full.names = TRUE)
    lst1 <- lapply(files, function(x) {
                x1 <- read.csv(x)
               i1 <- grep('^X([0-9.]+)?$', names(x1), invert = TRUE)
               x1[i1]
          })
    

    data

    df1 <- structure(list(Name = c("Michael", "Tiffany", "Bryan", "Sarah"
    ), X = c("some data", "some data", "some data", "some data"), 
        Ages = 2:5, X.1 = c("some other data", "some other data", 
        "some other data", "some other data"), Color = c("Blue", 
        "Red", "Green", "Orange"), X.2 = c("a third data", "a third data", 
        "a third data", "a third data"), Year = 2001:2004),
        class = "data.frame", row.names = c(NA, 
    -4L))