Search code examples
rdatasetdata-cleaning

Making an index/table of two dataset column names


I have a survey called 'pms_raw'. I import that to R from online, and then immediately make a copy, into 'survey' so I have pms_raw as raw data, and survey which I start cleaning.

I have cleaned up 'survey' which involved changing the names of all the columns, from things like 'Q1' to 'age'. Now I need to make some sort of index/table where I can check old Q names from pms_raw and compare to new Q names in 'survey' so others can see what I've done.

The way I was planning on doing this, was by making a table manually, which will be quite a long route. Does anyone have any suggestions of how to do this in a better way? It would probably involve some kind of matching of column names?

An added dimension is that some columns have been totally removed in the 'survey' so the table needs to have N/A at those spots.

In a dream world it would look like:

pms_raw   survey
Q1        age
Q2        height
Q3        weight

Thanks so much


Solution

  • I agree with IanCampbell completely. Until then, I'll demonstrate a "lookup" (named vector):

    mt <- mtcars[1:3,]
    newnames <- c(cyl="newcyl", mpg="newmpg", disp="newdisp")
    ifelse(names(mt) %in% names(newnames), newnames[ names(mt) ], names(mt) )
    #  [1] "newmpg"  "newcyl"  "newdisp" "hp"      "drat"    "wt"      "qsec"   
    #  [8] "vs"      "am"      "gear"    "carb"   
    names(mt) <- ifelse(names(mt) %in% names(newnames), newnames[ names(mt) ], names(mt) )
    mt
    #               newmpg newcyl newdisp  hp drat    wt  qsec vs am gear carb
    # Mazda RX4       21.0      6     160 110 3.90 2.620 16.46  0  1    4    4
    # Mazda RX4 Wag   21.0      6     160 110 3.90 2.875 17.02  0  1    4    4
    # Datsun 710      22.8      4     108  93 3.85 2.320 18.61  1  1    4    1
    

    Alternatives, all using the same premise:

    # same, but if_else can be safer in general to base ifelse
    dplyr::if_else(names(mt) %in% names(newnames), newnames[ names(mt) ], 
    # slightly different logic
    dplyr::coalesce(newnames[ names(mt) ], names(mt) )
    

    Another alternative, but first reverse the layout of the lookup.

    newnames <- c("newcyl"="cyl", "newmpg"="mpg", "newdisp"="disp")
    dplyr::rename(mt, !!newnames)