I have a set of UK postcodes which need to be reformatted. They are made up of an incode and an outcode, where incode is of the form 'number letter letter' e.g. 2DB and the outcode is a combination of between 2 and 4 letters and numbers e.g. NW1 or SW10 or EC1A
Currently there is one space between the incode and outcode, but I need to reformat these so that the full postcode is 7 characters long e.g: ('-' stands for space)
Data:
df <- data.frame("postcode"=c("NW1 2DB","SW10 9NH","E1 6QL"))
df
# postcode
# 1 NW1 2DB
# 2 SW10 9NH
# 3 E1 6QL
I have written a regex string to separate the outcode and incode, but couldn't find a way to add a variable number of spaces between them (this example just creates two spaces between outcode and incode).
require(dplyr)
df <- df %>% mutate(postcode_2sp = gsub('?(\\S+)\\s*?(\\d\\w{2})$','\\1 \\2', postcode)
To get around that I've tried to use mutate()
,nchar()
and rep()
:
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4-out_length),collapse=""),
incode))
but get this error:
Error: invalid 'times' argument
without the last step to create postcode7 the df looks as follows:
df
# postcode outcode incode out_length
# 1 NW1 2DB NW1 2DB 3
# 2 SW10 9NH SW10 9NH 4
# 3 E1 6QL E1 6QL 2
And if I set the rep 'times' argument to a constant the code runs as expected (but doesn't do what I need it to do!)
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4),collapse=""),
incode))
df
# postcode outcode incode out_length postcode7
# 1 NW1 2DB NW1 2DB 3 NW1 2DB
# 2 SW10 9NH SW10 9NH 4 SW10 9NH
# 3 E1 6QL E1 6QL 2 E1 6QL
Is there a way to make rep()
accept a column as the times argument in a mutate? Or should I be looking at a totally different approach?
EDIT: I've just realised that I can use an if
statement for each case of 2 characters, 3 characters or 4 characters in the outcode but that doesn't feel very elegant.
Another solution, using sprintf
to format the output, and tidyr::extract
for matching. This has the advantage of drastically simplifying both the pattern and the code for padding:
df %>%
extract(postcode, into = c('out', 'in'), '(\\S{2,4})\\s*(\\d\\w\\w)') %>%
mutate(postcode = sprintf('% -4s%s', out, `in`))
I do like the separate
version posted above, but it requires that the postcodes are all separated by whitespace. In my experience this generally isn’t the case.