Proprietary software produces ugly excel tables...can I get them into R?

We collect bat calls and analyze them and the outputs are a nightmare. I would like to streamline our process, but I'm stumped. How do I take dozens of files from excel, that look like this:

Excel output format

And get the import to add the rows, so that every set of four rows spaced by 2 rows (i.e. rows 3-6, 9-12, 15-18, etc--it's the same spacing for every project) are imported until empty space is reached (it'll be a different number of repetitions for each project)? I also want to endCol at empty space for each four-row segment...

I can specify the range for a given site easily, using:

   df<-t(readWorksheetFromFile("file directory",sheet=2,
        header=FALSE,startCol=2,startRow=3,endCol=5,endRow=6))

Then it gets pretty ugly, but I've done this:

    colnames(df)<-c("Species","n","percent","mle")
    BCID.df<-transform(BCID.df,Species=as.character(as.character(Species)),
    n=as.numeric(as.character(n)),percent=as.numeric(as.character(percent)),
               mle=as.numeric(as.character(mle)))

The output is formatted the way I want, but I need to fix the endRowand endCol, and don't know how...any suggestions would be hugely appreciated.

Solution

I would tackle this much like @Frank mentioned in comments. I'd read the whole thing in as one big file and then split it up into a list based on the file path information. Once these are split into a list each dataset can be cleaned up in a lapply loop.

I'm reaingd in via readxl::read_excel, but you can likely read the whole file in via function from XLconnect if you'd prefer.

library(readxl)
orig = read_excel("test.xlsx", col_names = FALSE)

The first six line of some fake data look like:

                                                    X0    X1    X2    X3    X4    X5    X6
                                                 <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 c:\\file directory\\acoustic data\\Site 10\\20160517  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2                               identification summary  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
3                                                   ID  EPFU  LANO  <NA>   MID  <NA>  <NA>
4                                                    N    70    12  <NA>     4  <NA>  <NA>
5                                                    %    16     3  <NA>    13  <NA>  <NA>
6                                               MLE(p)  1E-3  2E-3  <NA>  <NA>  <NA>  <NA>

This original file should be split into separate tables, which can be done based on the rows that contain the file path information starting with "c:". To figure out where these are, use cumsum with grepl on the first column of the whole dataset.

groups = cumsum(grepl("c:", orig$X0))

Use this vector to split the file, saving each separate table into a list via split.

orig_list = split(orig, groups)

Now the rest of the work is cleaning up each dataset, transposing things and removing any extra rows and columns. This is also where you might pull out the Site and Date information from the file path to add to the dataset to keep things organized, which I demonstrate but isn't strictly necessary. I put this all into a function for use in lapply. Note I used readr::type_convert at one point to make sure numeric variables were converted appropriately.

clean_data = function(data) {
    # Get rid of any empty headers (missing values in first column)
    new = data[!is.na(data[,1]),]

    # Transpose
    new = t(new)

    # Put into data.frame, removing extraneous columns 1 to 2
            # and using first row as variable names
    # Convert variables to appropriate type at same time using readr::type_convert
    new2 = readr::type_convert(as.data.frame(new[-1, -(1:2)]))
    names(new2) = new[1, -(1:2)]

    # Add site and date columns by pulling info from original "c:\\..." cell
    new2$site = unlist(strsplit(new[1], "\\\\"))[4]
    new2$date = unlist(strsplit(new[1], "\\\\"))[5]

    # Remove everything after first missing ID
    new2[cumsum(is.na(new2$ID)) == 0,]
}

Now loop through all chunks and clean each one. The resulting cleaned files will be in a list. You could row bind these together into one dataset if needed.

lapply(orig_list, clean_data)