Search code examples
rdataframedatatablematchmultiple-columns

How to replace column names in a .csv file by matching them with values from another file


I have two files that I am working with. Both have thousands of values, so some of the simpler/more traditional formatting methods with vector lists etc. don't work. The biggest issues I am running into are that the values I am working with are different lengths and that the closures are not subsettable. I am reluctant to pad the smaller file so that they are the same length, so I have been using setnames rather than match or another function.

The smaller file is the main one I am manipulating and the one that needs the column names to be replaced. It has over 6000 columns and is set up like so:

x           A1     A2     A3     A4     A13     A14
sample1     928    29     0      298    8392    138
sample2     0      239    903    13     424     2
sample3     348    930    1938   23     233     492
sample7     843    349    90     239    0       239
sample8     234    349    30     39     8249    845
sample19    849    0      1235   14     149     982

Etc. Neither the rows nor columns are sequential, as some have been removed. (This is why the files are no longer the same size/length.)

The second file is made up of only two columns, and matches the names with string values. It has over 10,000 rows and is set up like so:

a    x
A1  TIAHKGS
A2  ASJKHFSLA
A3  ASKJLHFAS
A4  JSHDKGFK
A5  ASHJFSKLHF
A6  ASLHFDASF
A7  ASHGFKJ

Etc. The string values are not all the same length, but the column is sequential. I am trying to replace the column names of the first file with the matching values from the second column of the second file.

The two methods that were closest to solving the problem were the match and setnames function. I was worried that match wouldn't be able to properly handle skipping values, so my latest attempt used setnames. This seemed? to fix the problem with different length files, but now I am receiving Error in data.frame$FileKey : #object of type 'closure' is not subsettable

Also, I am using vroom instead of readr because it is faster with large files.

library(vroom)
library(data.table)

MainFile <- vroom("File1.csv")
FileKey <- vroom("File2.csv")

old <- colnames(MainFile)
new <- FileKey[,c(2)]

setnames(MainFile, data.frame$old, data.frame$new, skip_absent=TRUE)
Error in data.frame$FileKey : 
#object of type 'closure' is not subsettable

names(MainFile)

Any advice is appreciated. I'm not sure if there is a way to melt or bind the files together that avoids both problems?

Thank you!


Solution

  • You can use match provided you handle the NA values correctly. Here is a base R version using match.

    vals <- FileKey$x[match(names(MainFile), FileKey$a)]
    names(MainFile)[!is.na(vals)] <- na.omit(vals)
    MainFile
    
             x TIAHKGS ASJKHFSLA ASKJLHFAS JSHDKGFK  A13 A14
    1  sample1     928        29         0      298 8392 138
    2  sample2       0       239       903       13  424   2
    3  sample3     348       930      1938       23  233 492
    4  sample7     843       349        90      239    0 239
    5  sample8     234       349        30       39 8249 845
    6 sample19     849         0      1235       14  149 982
    

    data

    MainFile <- structure(list(x = c("sample1", "sample2", "sample3", "sample7", 
    "sample8", "sample19"), A1 = c(928, 0, 348, 843, 234, 849), A2 = c(29, 
    239, 930, 349, 349, 0), A3 = c(0, 903, 1938, 90, 30, 1235), A4 = c(298, 
    13, 23, 239, 39, 14), A13 = c(8392, 424, 233, 0, 8249, 149), 
        A14 = c(138, 2, 492, 239, 845, 982)), class = "data.frame", row.names = c(NA, 
    -6L))
    
    FileKey <- structure(list(a = c("A1", "A2", "A3", "A4", "A5", "A6", "A7"
    ), x = c("TIAHKGS", "ASJKHFSLA", "ASKJLHFAS", "JSHDKGFK", "ASHJFSKLHF", 
    "ASLHFDASF", "ASHGFKJ")), class = "data.frame", row.names = c(NA,-7L))