Search code examples
rdataframedplyrtidyverselookup-tables

How to create a look-up table from a dataframe of changes?


I want to create a look-up table from a dataframe of changes. Each row of the original dataframe signifies a change in the coding for a given district. That dataset covers some time period, 2009 to 2019. While a district may experience several changes during that time period, I want the 2009 and 2019 codings of each district. That is, the first and latest coding.

The dataframe covers hundreds of districts. Some districts may go through just a single change, while others go through multiple. A district can be merged or split into multiple others.

The ideal look-up table would look as follows:

coding_2009 coding_2019
00QR S12000047
00QR S12000048
00RB S12000047
00RB S12000048

coding_2009 is the district's coding as of 2009, and coding_2019 is its latest coding as of 2019.

The original dataframe (a subset), where each row shows a change, looks like:

past new date
00QR S12000015 2009-01-01
S12000015 S12000047 2018-02-02
S12000015 S12000048 2018-02-02
00RB S12000015 2009-01-01
S12000024 S12000047 2018-02-02
S12000024 S12000048 2018-02-02

For each row, past is the code that is recoded into new as of date.

For example, district 00QR is turned into S12000015, which is later split into S12000047 and S12000048.

I've been dealing with this problem for weeks now, trying out different ad hoc versions but none seem to consistently work out. Please note that the code needs to consider that some districts experience just one change, whereas others may experience two or more changes. Districts can also be split or merged, as the example shows.

The ideal answer would use tidyverse.

For a reprex, I have subset a selection of districts below.

Thanks for your help! It would be appreciated immensely.

Reprex data: (you can also go above and beyond and use the original dataset, Changes.csv. See link below)

# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)

data <- tibble::tribble(
        ~past,        ~new,        ~date,
       "00RJ", "S12000013", "2009-01-01",
       "00QR", "S12000015", "2009-01-01",
       "00RB", "S12000024", "2009-01-01",
       "13UD", "E07000015", "2009-01-01",
       "15UH", "E07000025", "2009-01-01",
       "00HC", "E06000024", "2009-01-01",
       "00KG", "E06000034", "2009-01-01",
       "19UD", "E07000049", "2009-01-01",
       "19UE", "E07000050", "2009-01-01",
       "19UG", "E07000051", "2009-01-01",
       "19UH", "E07000052", "2009-01-01",
       "19UJ", "E07000053", "2009-01-01",
  "E07000017", "E06000049", "2009-04-01",
  "E07000025", "E06000053", "2009-04-01",
  "E07000014", "E06000049", "2009-04-01",
  "E07000015", "E06000049", "2009-04-01",
  "S12000013", "S12000013", "2015-06-16",
  "S12000013", "S12000013", "2015-11-01",
  "S12000015", "S12000047", "2018-02-02",
  "S12000024", "S12000047", "2018-02-02",
  "S12000015", "S12000048", "2018-02-02",
  "S12000024", "S12000048", "2018-02-02",
  "E07000049", "E06000059", "2019-04-01",
  "E07000050", "E06000059", "2019-04-01",
  "E07000053", "E06000059", "2019-04-01",
  "E07000051", "E06000059", "2019-04-01",
  "E07000052", "E06000059", "2019-04-01"
  )

# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)

For anyone interested, this data is from the UK's Code History Database. You can download the zip from the link below. It's the file named Changes.csv: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom. Note, in Changes.csv, past is named geogcd_p, new is geogcd and dateis oper_date.


Solution

  • You are essentially looking at a flat packed tree structure. It is easily graphed using the igraph package:

    
    library(igraph)
    g <- dat %>% select( past,new ) %>% t %>% c %>% graph
    plot( g )
    
    

    graph

    Now from here on there are many wayt to go about it, but it comes down to a depth first or width first approach to the problem.

    It is reasonable to assume that we have several small graphs. A bunch of different codes that have gone through a few changes, rather than a select few codes that have gone through many changes.

    This suggests a width first approach, and is solvable by joining the data to itself, hopefully, not too many times:

    
    ## work with data.table for that extra speed.
    setDT(dat)
    
    ## remove duplicate entries of same code
    dat <- dat[, .(date=max(date)), by=.(past,new) ]
    
    ## these are the roots, all `past` values never present in `new`
    roots <- dat[ !past %in% new ]
    
    ## likewise, the leaves are those that never appear as `past` , unless they are self referencing.
    leaves <- unique( dat[ !new %in% past | new == past, !"past" ], by="new" )
    
    
    dd <- copy(roots)
    
    ## sucessively add next step from the source data till we have arrived at leaves only.
    while( !all( dd$new %in% leaves$new ) ) {
        dd <- unique(
            merge( dd, dat, by.x="new", by.y="past", all.x=TRUE )[ , .(date.x, past, new=coalesce(new.y,new), date.y=coalesce(date.y,date.x) ) ]
        )
    }
    
    ## final cleanup
    dd[ order(past), .(coding_2009=past,coding_2019=new) ]
    
    

    Output:

    
    > dd[ order(past), .(coding_2009=past,coding_2019=new) ]
        coding_2009 coding_2019
     1:        00HC   E06000024
     2:        00KG   E06000034
     3:        00QR   S12000047
     4:        00QR   S12000048
     5:        00RB   S12000047
     6:        00RB   S12000048
     7:        00RJ   S12000013
     8:        13UD   E06000049
     9:        15UH   E06000053
    10:        19UD   E06000059
    11:        19UE   E06000059
    12:        19UG   E06000059
    13:        19UH   E06000059
    14:        19UJ   E06000059
    15:   E07000014   E06000049
    16:   E07000017   E06000049
    
    

    Now I have only looked at the mini dataset, so I have no idea how the code will eprform in the wild, but you could give it a go.

    Looking at the image above, we see that there are at most 3 steps for each graph from root to leaf, meaning the above while loop only had to run once.