I have dozens of very heavy Excel files that I need to import into R (then rebind). Each file has 2 sheets, where the second sheet (name: "Results") consists of 100K rows at least and has about 350 columns.
I would like to read a subset of the sheet "Results" from each file by columns, but most importantly, by specific rows. Each "ID" in the data, has a main row and then multiple rows below which contain data in specific columns. I would like to read the main row only (this leaves each file with 50-400 rows (depending on the file) and 150 variables). The first column that numbers main rows does not have a header.
This is what the data looks like (simplified):
I would like to import only the rows whose first column isn't empty but numbered (i.e., 1., 13., 34., 211.) and particular columns, in this example columns 2,3,5 (i.e., name, ID, status). The desired output would be:
Is there a simple way to do this?
Let's say a
is our excel file, as data frame.
library(readxl)
a <- as.data.frame(read_excel("Pattern/File.xlsx",sheet = "Results"))
For instance, we want to select columns 1 to 3, so use
subset(a[,1:3],is.na(a[1])==FALSE)
By this function, you are subsetting the input data frame with values different than NA in first column.
Output:
...1 name ID
1 1 Dan us1d
4 13 Nev sa2e
6 34 Sam il5a
Note first column name (" ...1 "). This is autogenerated by read_excel()
function, but should not be a problem.