I'm loading a lot of data into R from CSVs. The CSVs contain columns with empty headers, and I would like to drop those columns from my data set. I've done some searching and I haven't found a good way to do this. R fills the columns with X2, X3, etc. I don't want to remove based on X, because they might drop some columns that I need. I have about 200 CVS that I need to load here.
Any help would be greatly appreciated.
Here's the text file of a sample csv I used to work through this. In my actual data, the blank columns don't always occur in every other cell.
Name,,Ages,,Color,,Year
Michael,some data,2,some other data,Blue,a third data,2001
Tiffany,some data,3,some other data,Red,a third data,2002
Bryan,some data,4,some other data,Green,a third data,2003
Sarah,some data,5,some other data,Orange,a third data,2004
Here's what I'd like to get back.
Name,Ages,Color,Year
Michael,2,Blue,2001
Tiffany,3,Red,2002
Bryan,4,Green,2003
Sarah,5,Orange,2004
We can create a regex to match column names that starts (^
) with X
followed by .
or digits till the end ($
) of the string in grep
, specify invert = TRUE
for negating
i1 <- grep('^X([0-9.]+)?$', names(df1), invert = TRUE)
df1[i1]
-output
# Name Ages Color Year
#1 Michael 2 Blue 2001
#2 Tiffany 3 Red 2002
#3 Bryan 4 Green 2003
#4 Sarah 5 Orange 2004
Or using dplyr
library(dplyr)
df1 %>%
select(!matches('X([0-9.]+)?$'))
If we have a lots of files to read from the working directory, read it in a list
, and do the same
files <- list.files(pattern = '\\.csv$', full.names = TRUE)
lst1 <- lapply(files, function(x) {
x1 <- read.csv(x)
i1 <- grep('^X([0-9.]+)?$', names(x1), invert = TRUE)
x1[i1]
})
df1 <- structure(list(Name = c("Michael", "Tiffany", "Bryan", "Sarah"
), X = c("some data", "some data", "some data", "some data"),
Ages = 2:5, X.1 = c("some other data", "some other data",
"some other data", "some other data"), Color = c("Blue",
"Red", "Green", "Orange"), X.2 = c("a third data", "a third data",
"a third data", "a third data"), Year = 2001:2004),
class = "data.frame", row.names = c(NA,
-4L))