Search code examples
r

Finding out which columns are same in different files


I have a data frames in R in the global environment :

file_1 <- data.frame(A = 1:5, B = 6:10, C = 11:15)
file_2 <- data.frame(A = 1:5, D = 16:20, E = 21:25)
file_3 <- data.frame(B = 6:10, C = 11:15, F = 26:30)

I want to make a matrix that helps me understand which column names are common in all data frames and which are not.

I tried to do this manually:

for (file in files) {
  data <- get(file)
  column_names[[file]] <- colnames(data)
}

all_columns <- unique(unlist(column_names))
matrix <- sapply(column_names, function(cols) all_columns %in% cols)
rownames(matrix) <- all_columns

matrix_df <- as.data.frame(matrix)

print(matrix_df)

Is this the correct way to do this in R?

BTW, if they were in a list, I think we could do it like this:

all_columns <- unique(unlist(lapply(mylist, colnames)))

matrix <- sapply(mylist, function(df) all_columns %in% colnames(df))
rownames(matrix) <- all_columns

matrix_df <- as.data.frame(matrix)

print(matrix_df)

Solution

  • Do you mean a matrix like below?

    > table(stack(lapply(mget(ls(pattern = "file_")), names)))
          ind
    values file_1 file_2 file_3
         A      1      1      0
         B      1      0      1
         C      1      0      1
         D      0      1      0
         E      0      1      0
         F      0      0      1