I want to create a look-up table from a dataframe of changes. Each row of the original dataframe signifies a change in the coding for a given district. That dataset covers some time period, 2009 to 2019. While a district may experience several changes during that time period, I want the 2009 and 2019 codings of each district. That is, the first and latest coding.
The dataframe covers hundreds of districts. Some districts may go through just a single change, while others go through multiple. A district can be merged or split into multiple others.
The ideal look-up table would look as follows:
coding_2009 | coding_2019 |
---|---|
00QR | S12000047 |
00QR | S12000048 |
00RB | S12000047 |
00RB | S12000048 |
coding_2009
is the district's coding as of 2009, and coding_2019
is its latest coding as of 2019.
The original dataframe (a subset), where each row shows a change, looks like:
past | new | date |
---|---|---|
00QR | S12000015 | 2009-01-01 |
S12000015 | S12000047 | 2018-02-02 |
S12000015 | S12000048 | 2018-02-02 |
00RB | S12000015 | 2009-01-01 |
S12000024 | S12000047 | 2018-02-02 |
S12000024 | S12000048 | 2018-02-02 |
For each row, past
is the code that is recoded into new
as of date
.
For example, district 00QR
is turned into S12000015
, which is later split into S12000047
and S12000048
.
I've been dealing with this problem for weeks now, trying out different ad hoc versions but none seem to consistently work out. Please note that the code needs to consider that some districts experience just one change, whereas others may experience two or more changes. Districts can also be split or merged, as the example shows.
The ideal answer would use tidyverse
.
For a reprex, I have subset a selection of districts below.
Thanks for your help! It would be appreciated immensely.
Reprex data:
(you can also go above and beyond and use the original dataset, Changes.csv
. See link below)
# Library tibble (a part of tidyverse) is needed to copy paste reprex data
#install.packages("tibble") # if you need to install it
library(tibble)
data <- tibble::tribble(
~past, ~new, ~date,
"00RJ", "S12000013", "2009-01-01",
"00QR", "S12000015", "2009-01-01",
"00RB", "S12000024", "2009-01-01",
"13UD", "E07000015", "2009-01-01",
"15UH", "E07000025", "2009-01-01",
"00HC", "E06000024", "2009-01-01",
"00KG", "E06000034", "2009-01-01",
"19UD", "E07000049", "2009-01-01",
"19UE", "E07000050", "2009-01-01",
"19UG", "E07000051", "2009-01-01",
"19UH", "E07000052", "2009-01-01",
"19UJ", "E07000053", "2009-01-01",
"E07000017", "E06000049", "2009-04-01",
"E07000025", "E06000053", "2009-04-01",
"E07000014", "E06000049", "2009-04-01",
"E07000015", "E06000049", "2009-04-01",
"S12000013", "S12000013", "2015-06-16",
"S12000013", "S12000013", "2015-11-01",
"S12000015", "S12000047", "2018-02-02",
"S12000024", "S12000047", "2018-02-02",
"S12000015", "S12000048", "2018-02-02",
"S12000024", "S12000048", "2018-02-02",
"E07000049", "E06000059", "2019-04-01",
"E07000050", "E06000059", "2019-04-01",
"E07000053", "E06000059", "2019-04-01",
"E07000051", "E06000059", "2019-04-01",
"E07000052", "E06000059", "2019-04-01"
)
# Convert date to Date (after being copy pasted as tibble)
data$date <- as.Date(data$date)
For anyone interested, this data is from the UK's Code History Database
. You can download the zip from the link below. It's the file named Changes.csv
: https://geoportal.statistics.gov.uk/datasets/code-history-database-december-2019-for-the-united-kingdom. Note, in Changes.csv
, past
is named geogcd_p
, new
is geogcd
and date
is oper_date
.
You are essentially looking at a flat packed tree structure. It is easily graphed using the igraph package:
library(igraph)
g <- dat %>% select( past,new ) %>% t %>% c %>% graph
plot( g )
Now from here on there are many wayt to go about it, but it comes down to a depth first or width first approach to the problem.
It is reasonable to assume that we have several small graphs. A bunch of different codes that have gone through a few changes, rather than a select few codes that have gone through many changes.
This suggests a width first approach, and is solvable by joining the data to itself, hopefully, not too many times:
## work with data.table for that extra speed.
setDT(dat)
## remove duplicate entries of same code
dat <- dat[, .(date=max(date)), by=.(past,new) ]
## these are the roots, all `past` values never present in `new`
roots <- dat[ !past %in% new ]
## likewise, the leaves are those that never appear as `past` , unless they are self referencing.
leaves <- unique( dat[ !new %in% past | new == past, !"past" ], by="new" )
dd <- copy(roots)
## sucessively add next step from the source data till we have arrived at leaves only.
while( !all( dd$new %in% leaves$new ) ) {
dd <- unique(
merge( dd, dat, by.x="new", by.y="past", all.x=TRUE )[ , .(date.x, past, new=coalesce(new.y,new), date.y=coalesce(date.y,date.x) ) ]
)
}
## final cleanup
dd[ order(past), .(coding_2009=past,coding_2019=new) ]
Output:
> dd[ order(past), .(coding_2009=past,coding_2019=new) ]
coding_2009 coding_2019
1: 00HC E06000024
2: 00KG E06000034
3: 00QR S12000047
4: 00QR S12000048
5: 00RB S12000047
6: 00RB S12000048
7: 00RJ S12000013
8: 13UD E06000049
9: 15UH E06000053
10: 19UD E06000059
11: 19UE E06000059
12: 19UG E06000059
13: 19UH E06000059
14: 19UJ E06000059
15: E07000014 E06000049
16: E07000017 E06000049
Now I have only looked at the mini dataset, so I have no idea how the code will eprform in the wild, but you could give it a go.
Looking at the image above, we see that there are at most 3 steps for each graph from root to leaf, meaning the above while loop only had to run once.