Search code examples
rdplyrforcats

Am I missing some slicker way? Re-factoring a tibble using a tibble from NHSDataDictionaRy


So I needed to re-work a data import and I saw the NHS Data Dictionary package had been released for R so I thought I could use it to map the 1's and 2's in my Gender column to more usefully named factors (i.e. Male and Female), I can never remember which is which so they definitely need sorted.

My example dataset:

gender age
1 40
2 43
2 56
1 72
9 34

Getting the data from NHS Data Dictionary with the new package is easy enough - if a little heavy on the typing...

It would be easy enough to do a join to my data, but that 'feels' the wrong way in this case.

require(dplyr)
require(forcats)
require(NHSDataDictionaRy)

myData <- tibble ( gender = as.factor(c("1","2","2","1", "9")), 
                   age = c(40,43,56,72,34)
                   )

nhs_data_lookup <- nhs_data_elements()

tableR(nhs_data_lookup$full_url[nhs_data_lookup$link_name == "PERSON STATED GENDER CODE"], 
       nhs_data_lookup$xpath_nat_code[nhs_data_lookup$link_name == "PERSON STATED GENDER CODE"] ) %>%
    select(Code, Description) -> genderLookup

set_names( genderLookup$Code, genderLookup$Description) -> genderLookup

myData %>%
    mutate(gender = fct_recode(gender, !!!genderLookup))

Gives:

gender age
Male 40
Female 43
Female 56
Male 72
Indeterminate (unable to be classified as either male or female) 34

The above works. But its bot ugly code!

1. The Lookup could be a lot cleaner. Perhaps it just needs me to write a function?

NHStableR <- function(link_name) {
    nhs_data_lookup <- nhs_data_elements()
    tableR(nhs_data_lookup$full_url[nhs_data_lookup$link_name == link_name], 
           nhs_data_lookup$xpath_nat_code[nhs_data_lookup$link_name == link_name] ) %>%
        select(Code, Description) 
}

NHStableR("PERSON STATED GENDER CODE")

Wondering if this should be proposed as an update?

But then...

2. The factoring saving to an object, 'flattening' the object into a named list and then recoding the factor with it feels very 'non-tidyverse'

So would a named elements version of the function be better? OR have I done what I seem very good at - and created 3 steps when I could have typed doSomeMagicThing(Gender) and it been sorted?

NHSnameR <- function(link_name) {
    nhs_data_lookup <- nhs_data_elements()
    tableR(nhs_data_lookup$full_url[nhs_data_lookup$link_name == link_name], 
           nhs_data_lookup$xpath_nat_code[nhs_data_lookup$link_name == link_name] ) %>%
        select(Code, Description) -> x
    set_names( x$Code, x$Description)
}

myData %>%
    mutate(gender = fct_recode(gender, !!!NHSnameR("PERSON STATED GENDER CODE")))

It could be that I don't know the data dictionary well enough to know if this would work?

So: 1. Have I missed something that would be tidier (pun intended!)

2. If not, then anyone got thoughts on this as a function? I realize this is a bit niche!

3. Perhaps the easiest - if the dictionary has options not in the data (say there was no unknown gender in my data) - it generates a warning. Can they be suppressed?

4. Is doing an HTTP lookup of data in the code like that bad


Solution

  • New functions added to development release that now offer these.

    NHStableR( "PERSON STATED GENDER CODE" )
    
    tableRtoNamedList ( "PERSON STATED GENDER CODE", table_method = "NHStableR" )