Search code examples
rimportformattingimport-from-excelreadxl

Import data with format


I have a question about import multiple data(excel or csv).
I know if I want to read multiple excel data(same column name) at the same time, the code is:

library(readxl)
file.list <- dir(path = "/path", pattern='\\.xlsx', full.names = T)
df.list <- lapply(file.list, read_excel)
data <- rbindlist(df.list)

So, I read them at the same time and combine to one data.
However, I take one of my data as an example.
In excel, row1 has a title name and row2 is header, so the obs. begins at row3.
Also, if the first data looks like this in EXCEL:

TITLE~~~~~
col1    col2    col3
A       1       3.59283E+14 
B       2       3.59258E+14
C       3       3.59286E+14
REFUND
A       -1      3.59286E+14

But, col3 is defined as numeric in excel. In reality, it looks like:

TITLE~~~~~
col1    col2    col3
A       1       359283060959987 
B       2       359258069826064
C       3       359286062903911
REFUND
A       -1      359283060959987

Row6 has REFUND. In my multiple data, I don't know which row has REFUND. I want to read my obs. without these rows. How can I do?
Actually, col3 is character. But in excel, it looks like numeric.
How can I define it to be character when import to R, so after importing, it doesn't show exponential sign.


Solution

  • I don't think there's a way to read your data directly without the REFUND rows, at least with read_excel function. However I'm quite new in R and I can be wrong.

    That said, the first thing that occurs to me is to build your own function. The one below seems to work.

    library(readxl)
    library(data.table)
    
    file.list <- dir(path = ".", pattern='\\.xlsx', full.names = T)
    
    my_read_data<-function(x){ #x list of files
    
      df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE,
                                                 col_types=c("text","numeric","text"))})
      #skip -> skip the line with the title
      #col_names -> use the first row as column names, i.e., col1, col2 and col3
      #col_types-> vector containing one entry per column indicating the type of data
    
      my.data <- rbindlist(df.list)
      my.data.clean<-my.data[my.data$col1!="REFUND",]  #select only rows without "REFUND"
    
      return(my.data.clean)
    }
    

    In order to run the function I have duplicate your excel example four times, changing the location of the REFUND row. The result I get is as follow.

    the.data<-my_read_data(file.list) 
    
    >the.data
       col1 col2            col3
    1:    A    1 359283060959987
    2:    B    2 359258069826064
    3:    C    3 359286062903911
    4:    A   -1 359283060959987
    5:    A    1 359283060959987
    6:    B    2 359258069826064
    7:    C    3 359286062903911
    8:    A   -1 359283060959987
    9:    A    1 359283060959987
    10:    B    2 359258069826064
    11:    C    3 359286062903911
    12:    A   -1 359283060959987
    13:    A    1 359283060959987
    14:    B    2 359258069826064
    15:    C    3 359286062903911
    16:    A   -1 359283060959987
    

    EDIT - Function to pass the column to be changed to character type

    Regarding your comment, maybe you'd want to can consider this function instead :

    my_read_data2<-function(x,character_col=NULL){ #x->list of files 
                                                   # character_col->column to be change to character
                                                  # can be more than one
    
      df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE)})
      my.data <- rbindlist(df.list)
      my.data.clean<-my.data[my.data$col1!="REFUND",]  #select only rows without "REFUND"
    
      # changing column selected by character_col to character
      # since the result from step above is a data table, 
      # access to elements is different from data frame
    
      if(!is.null(character_col)){ #this allow you to use the function using only
                                   # default results from read_excel
    
          my.data.clean[, eval(character_col):= lapply(.SD, as.character),
                        .SDcols= character_col]
      }
      # eval -> you need to evaluate the argument you pass to the function,
      #         otherwise you'll end up with an additional character_col column
      #         that will be a list of all the columns you include in .SDcols
      #.SD -> is the subset of the data table, in this case 
      #       .SDcols specifies the columns that are included in .SD.
    
      return(my.data.clean[]) # in that case, don't forget the [] to avoid
                              #the odd behaviour when calling your resulting data table
                              #(see link at the end)                            
     }
    

    Examples:

        the.data<-my_read_data2(file.list)
    str(the.data)
    
     >str(the.data)
        Classes ‘data.table’ and 'data.frame':  16 obs. of  3 variables:
        $ col1: chr  "A" "B" "C" "A" ...
        $ col2: num  1 2 3 -1 1 2 3 -1 1 2 ...
        $ col3: num  3.59e+14 3.59e+14 3.59e+14 3.59e+14 3.59e+14 ...
        - attr(*, ".internal.selfref")=<externalptr> 
    
    the.data1<-my_read_data2(file.list,"col3")
    str(the.data1) 
    
    > str(the.data1)
        Classes ‘data.table’ and 'data.frame':  16 obs. of  3 variables:
        $ col1: chr  "A" "B" "C" "A" ...
        $ col2: num  1 2 3 -1 1 2 3 -1 1 2 ...
        $ col3: chr  "359283060959987" "359258069826064" "359286062903911" "359283060959987" ...
        - attr(*, ".internal.selfref")=<externalptr> 
    

    you could also use more than one column:

    the.data2<-my_read_data2(file.list,c("col2","col3"))
    the.data3<-my_read_data2(file.list,c(2,3))      
    

    data.table objects not printed after returned from function

    Hope it helps you