Search code examples
rrbindlist

How to define the first column before import and bind ( rbindlist ) in R


Goal

import and bind a list of xlsx file with unpredicted irrelevant first column. (Have to eliminate it but you don't know which file contains irrelevant first column.)

# sample : remind that the xlsx file originally don't have any column names

    DT1 = data.table(a = c("TYPE","X","Y","Z"), b = c("MONTH","Jan","Feb","Mar"))
    DT2 = data.table(a = c(NA,NA,NA,"random_irrelevant_vale") ,b = c("TYPE","X","Y","W"), c = c("MONTH","Apr","Feb","May"))

# give "idcol" & col_names = FALSE is a must
    PATH_LIST<- list.files(path = "PATH",
                               pattern = "*.xlsx",full.names = TRUE)
    names(PATH_LIST) <- basename(PATH_LIST)

    rbindlist(lapply(PATH_LIST,import,col_names = FALSE,
                             col_types = "text"),
                      idcol = "source",fill = TRUE) 

    # EXPECTED OUTPUT
    TYPE MONTH
    X    Jan
    Y    Feb
    Z    Mar
    X    Apr
    Y    Feb
    W    May
  • Cannot simply use !grepl("random_irrevelevant_value",x) as "random_irrevelevant_value" is a random value .

How can i achieve this between the step of import and rbind them ?


Solution

  • Assuming that the relevant columns are at the back and there are some files with no irrelevant columns, you can try something like:

    DTls <- lapply(list.files(pattern="DT(.*).csv"), fread, header=FALSE)
    m <- min(lengths(DTls))
    rbindlist(
        lapply(DTls, function(DT) {
            cols <- head(names(DT), length(DT)-m)
            if (length(cols) > 0)
                DT[, (cols) := NULL]
            DT
        })
    )
    

    output:

         V1    V2
    1: TYPE MONTH
    2:    X   Jan
    3:    Y   Feb
    4:    Z   Mar
    5: TYPE MONTH
    6:    X   Apr
    7:    Y   Feb
    8:    W   May
    

    data files:

    library(data.table)
    DT1 = data.table(a = c("TYPE","X","Y","Z"), b = c("MONTH","Jan","Feb","Mar"))
    DT2 = data.table(a = c(NA,NA,NA,"random_irrelevant_vale") ,b = c("TYPE","X","Y","W"), c = c("MONTH","Apr","Feb","May"))
    fwrite(DT1, "DT1.csv", col.names=FALSE)
    fwrite(DT2, "DT2.csv", col.names=FALSE)