I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf
in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.
Here is some fictitious data to orient you (as a dput ASCII text representation)
data <- data.frame(
patient.first.name = factor(rep(c("John", "Kathy", "Timothy"), c(5L, 5L, 4L))),
patient.last.name = factor(rep(c("Squeal", "Martinez", "Jones"), c(5L, 5L, 4L))),
medical.record.nr = rep(c(4563455, 2663775, 3330956), c(5L, 5L, 4L)),
date.of.service = c(
39087, 39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360, 39184,
39194, 39198, 39216
),
procedure.code = c(
44750, 38995, 40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750,
44729, 44729, 44729
),
diagnosis.code.1 = rep(c(456.87, 521.37, 356.36, 456.87), c(5L, 4L, 1L, 4L)),
diagnosis.code.2 = rep(c(413, 532.23, NA), c(5L, 1L, 8L)),
referring.doctor.first = factor(rep(c("Abe", "Mark", NA, "Abe", NA), c(5L, 3L, 3L, 2L, 1L))),
referring.doctor.last = factor(rep(c("Newstead", "Wydell", NA, "Newstead", NA), c(5L, 3L, 3L, 2L, 1L))),
referring.docotor.zip = rep(c(15209, 15222, NA, 15209, NA), c(5L, 3L, 3L, 2L, 1L)),
some.other.stuff = factor(c(
"alkjkdkdio", "alkjkdkdio", "alkjkdkdio", NA, "ddddd", NA, NA, "dddddd", NA,
"lkjljkkkkk", NA, "cheerios", "dogs", NA
))
)
The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.
In zoo na.locf
I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.
@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf
. The problem is not amenable to rolling joins.
setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by
statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE
argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA
removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE
to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple :=
per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]