Search code examples
rdataframecomparisoncolumnsorting

Comparing certain column names in two dataframes


I am looking for a way to compare two different data frame column names (only certain ranges of columns) and keeping only the names that have the same shared similarity of column names between the two data frames. (I only want to compare each data frame from a certain point forward only though.

I would have a significantly larger number of columns, so i want to see what would work best in larger data sets.

Say i have DF1

ID Location Value2 Value3 Value4
First Park 4 3 3
Second House 2 5 2
Second Barn 4 5 6
Third Lake 1 8 8
Third Airport 7 5 4
Fourth Car 4 5 1

And DF2

Name Identifier City Value2 Value3
Jeff ISE2 Seattle 1 6
George GSY7 Houston 2 2
Carl BHU1 Miami 3 7
Mike POI0 Los Angeles 8 8
Linel ANN1 DC 5 4
Pork CNU4 Portland 5 1

In DF1, we will keep ID and Location column regardless, and for DF2, we keep Name, Identifier, and City regardless. My goal is to compare the other numerical columns to see if there is a match in the name. So in this case, the match is "Value2" and "Value3."

So the processed data frames would be:

New DF1

ID Location Value2 Value3
First Park 4 3
Second House 2 5
Second Barn 4 5
Third Lake 1 8
Third Airport 7 5
Fourth Car 4 5

And new DF2

Name Identifier City Value2 Value3
Jeff ISE2 Seattle 1 6
George GSY7 Houston 2 2
Carl BHU1 Miami 3 7
Mike POI0 Los Angeles 8 8
Linel ANN1 DC 5 4
Pork CNU4 Portland 5 1

Is one way comparing the column names of each DF and compare it to the other? I would appreciate any help on this. Thank you!

ID <- c("First", "Second", "Second", "Third", "Third", "Fourth")
Location <- c("Park","House","Barn","Lake","Airport","Car")
Value2 <- c(4,2,4,1,7,4)
Value3 <- c(3,5,5,8,5,5)
Value4 <- c(3,2,6,8,4,1)
DF1 <- data.frame(ID, Location, Value2, Value3, Value4)
Name <- c("Jeff", "George", "Carl", "Mike", "Linel", "Pork")
Identifier <- c("ISE2","GSY2","BHU1","POI0","ANN1","CNU4")
City <- c("Seattle","Houston","Miami","Los Angeles","DC","Portland")
Value2 <- c(1,2,3,8,5,5)
Value3 <- c(6,2,7,8,4,1)
DF2 <- data.frame(Name, Identifier, City, Value2, Value3)

Solution

  • Here is a base R option

    # Common column-name pattern across `data.frame`
    pattern <- "Value"
    common_nms <- intersect(
        names(DF1)[grep(pattern, names(DF1))], 
        names(DF2)[grep(pattern, names(DF2))])
    
    # Unique and and "pattern-shared" columns in `DF1`
    DF1[c(names(DF1)[-grep(pattern, names(DF1))], common_nms)]
    #      ID Location Value2 Value3
    #1  First     Park      4      3
    #2 Second    House      2      5
    #3 Second     Barn      4      5
    #4  Third     Lake      1      8
    #5  Third  Airport      7      5
    #6 Fourth      Car      4      5
    
    # Unique and and "pattern-shared" columns in `DF2`
    DF2[c(names(DF2)[-grep(pattern, names(DF2))], common_nms)]
    #    Name Identifier        City Value2 Value3
    #1   Jeff       ISE2     Seattle      1      6
    #2 George       GSY2     Houston      2      2
    #3   Carl       BHU1       Miami      3      7
    #4   Mike       POI0 Los Angeles      8      8
    #5  Linel       ANN1          DC      5      4
    #6   Pork       CNU4    Portland      5      1
    

    The idea is to use a pattern to filter column names and then determine common names through intersect. Final column names are then given by column names that don't match the pattern (the "unique" names) and the common column names from the intersect.