Search code examples
rread.csvreadxl

In R is there a way to read files and check the first column of unique IDs against a predetermined list of IDs and return only those files or names?


I have a folder of a few thousand files (Both .csv and .xls) and in each of these files the first column is made up of unique ID numbers. The other fields in these files are different pieces of data that I'll need to extract with respect to that unique ID number. The catch is that I have a list of predetermined ID numbers that I need to pull the data for. Some files may or may not have 1 or more of my predetermined list of IDs in them. How do I check the first column in these files against my predetermined list of IDs and return the filenames of the files that contain 1 or more of my predetermined list of IDs?


Solution

  • The following should work:

    
    library(xlsx)
    library(readxl) # for read_xls
    my_path="C:/Users/Desktop/my_files"
    
    # Collect the names of the files
    list_doc_csv=list.files(path = my_path, pattern = ".csv", all.files = TRUE)
    list_doc_xlsx=list.files(path = my_path, pattern = ".xlsx", all.files = TRUE)
    list_doc_xls=list.files(path = my_path, pattern = ".xls", all.files = TRUE)
    # Step needed as .xlsx files were select as having ".xls" patterns
    list_doc_xls=list_doc_xls[which(!list_doc_xls%in%list_doc_xlsx)] 
    
    # Declare ID of interest
    ID_interesting=c("id1","id33","id101")
    
    
    list_interesting_doc=c()
    # Loop on CSV files and check the content of first column
    for (doc in list_doc_csv){
      column1=read.csv(file=paste0(my_path,"/",doc))[,1]
      if(sum(column1%in%ID_interesting)>0){
        list_interesting_doc=c(list_interesting_doc,doc)
      }
    }
    # Loop on .xlsx files
    for (doc in list_doc_xlsx){ 
      column1=read.xlsx(file=paste0(my_path,"/",doc),sheetIndex = 1)[,1]
      if(sum(column1%in%ID_interesting)>0){
        list_interesting_doc=c(list_interesting_doc,doc)
      }
    }
    # Loop on .xls files
    for (doc in list_doc_xls){ 
      column1=unlist(read_xls(path=paste0(my_path, "/", doc))[,1])
      if(sum(column1%in%ID_interesting)>0){
        list_interesting_doc=c(list_interesting_doc,doc)
      }
    }
    
    print(list_interesting_doc)