The R package growthcurver
is great for efficient analysis and visualization of organism growth except when there are missing values. Because I have data in wide format (each column is a variable) and the times were random for each variable, there are a ton of NA
s. Unfortunately, the growthcurver
package does not like NA
s, so now I'm stuck with 2 options.
Option A
mice
, Hmisc
, for regression imputation but failed because there are more variables (columns) than observations in each column and caret
for random forest imputation, which did not produce any meaningful imputed values). Imputation also then creates my dataframe to be mostly imputed values which I can't justify.Option B
growthcurver
function to handle NA
s better than it currently does. I tried poking around with the function but couldn't find a spot where a simple na.omit()
could be plopped in.Here's the code that worked with the single-use function SummarizeGrowth()
(when I manually removed NA
s). I should note that this function is useful when one only has a few observations to analyze/visualize but ideally, I would use the function SummarizeGrowthByPlate()
which is a package-derived apply()
function that loops through each column (variable) automatically producing visualizations and results.
Example Dataframe
time a b c d e f g
1 0.00002 NA NA NA NA NA NA NA
2 0.00003 NA NA NA NA NA NA 0.0000
3 22.00000 NA NA NA NA NA NA NA
4 24.01000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 NA
5 24.03000 NA NA NA NA NA NA 0.0666
6 28.00000 NA NA NA NA NA NA NA
7 36.00000 0.2220 0.2775 0.2775 0.1776 0.1221 0.1221 NA
8 39.00000 NA NA NA NA NA NA 0.2442
9 40.00000 NA NA NA NA NA NA NA
10 44.00000 0.3330 0.3885 0.3552 0.3108 0.2664 0.1998 NA
11 46.00000 NA NA NA NA NA NA NA
12 64.00000 NA NA NA NA NA NA 0.7881
13 67.00000 0.9435 1.2210 1.1655 0.9990 1.5984 0.5217 NA
14 88.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096 NA
15 108.00000 NA NA NA NA NA NA 1.6983
Reproducible Data
df <- structure(list(time = c(2e-05, 3e-05, 22, 24.01, 24.03, 28, 36,
39, 40, 44, 46, 64, 67, 88, 108), a = c(NA, NA, NA, 0.1443, NA,
NA, 0.222, NA, NA, 0.333, NA, NA, 0.9435, 1.8093, NA), b = c(NA,
NA, NA, 0.1554, NA, NA, 0.2775, NA, NA, 0.3885, NA, NA, 1.221,
1.8093, NA), c = c(NA, NA, NA, 0.0999, NA, NA, 0.2775, NA, NA,
0.3552, NA, NA, 1.1655, 1.8093, NA), d = c(NA, NA, NA, 0.111,
NA, NA, 0.1776, NA, NA, 0.3108, NA, NA, 0.999, 1.887, NA), e = c(NA,
NA, NA, 0.0999, NA, NA, 0.1221, NA, NA, 0.2664, NA, NA, 1.5984,
1.6872, NA), f = c(NA, NA, NA, 0.0666, NA, NA, 0.1221, NA, NA,
0.1998, NA, NA, 0.5217, 1.5096, NA), g = c(NA, 0, NA, NA, 0.0666,
NA, NA, 0.2442, NA, NA, NA, 0.7881, NA, NA, 1.6983)), class = "data.frame", row.names = c(NA,
-15L))
Success, but required manual removal of NAs from of a single column with SummarizeGrowth()
library(growthcurver)
SummarizeGrowth(df$time[!is.na(df$a)], df$a[!is.na(df$a)])
Fit data to K / (1 + ((K - N0) / N0) * exp(-r * t)):
K N0 r
val: 2.121 0.004 0.085
Residual standard error: 0.02857429 on 2 degrees of freedom
Other useful metrics:
DT 1 / DT auc_l auc_e
8.13 1.2e-01 38.16 38.77
Failure when not manually removing NAs with SummarizeGrowth()
SummarizeGrowth(df$time, dfb$a)
Fit data to K / (1 + ((K - N0) / N0) * exp(-r * t)):
K N0 r
val: 0 0 0
Residual standard error: 0 on 0 degrees of freedom
Other useful metrics:
DT 1 / DT auc_l auc_e
0 0 0 0
Note: cannot fit data
Failure when trying to use automated SummarizeGrowthByPlate()
SummarizeGrowthByPlate(df)
sample k n0 r t_mid t_gen auc_l auc_e sigma note
1 a 0 0 0 0 0 0 0 0 cannot fit data
2 b 0 0 0 0 0 0 0 0 cannot fit data
3 c 0 0 0 0 0 0 0 0 cannot fit data
4 d 0 0 0 0 0 0 0 0 cannot fit data
5 e 0 0 0 0 0 0 0 0 cannot fit data
6 f 0 0 0 0 0 0 0 0 cannot fit data
7 g 0 0 0 0 0 0 0 0 cannot fit data
The problem with mice, Hmisc is that they are not doing time series imputation. They only look at the inter-variable correlations. Which means when a row is completely NA - they can't compute anything for this row. (logically there must be at least one regressor in a row to perform a regression)
Since there seems to be a clear correlation in time for each of your variables you could look at time series imputation / interpolation.
There is the imputeTS package, which offers a lot of time series imputation algorithms. But would be hard to use it here, since it requires equally-spaced-time series (meaning same time difference between each row)as input. For using this package you would first have to convert the time-series to be equally-spaced. Which does not seem like a good idea for this specific case.
As far as I know the package zoo can perform time series imputation on irregular spaced time series. So this package might be the best choice for you. I would specifically try the na.approx()
- linear interpolation function.
Unfortunately I can't quickly give an working example. The usage is basically:
library(zoo)
na.approx(zooobject)
The only thing you have to figure out now is how to convert your df to a zoo series (which is required as input)
Just as a showcase that it might be worth the effort - here is a working example with imputeTS (where you do not need a zoo object before)
library(imputeTS)
na_interpolation(df)
time a b c d e f g
1 0.00002 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.000000
2 0.00003 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.000000
3 22.00000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.022200
4 24.01000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.044400
5 24.03000 0.1702 0.1961 0.1591 0.1332 0.1073 0.0851 0.066600
6 28.00000 0.1961 0.2368 0.2183 0.1554 0.1147 0.1036 0.125800
7 36.00000 0.2220 0.2775 0.2775 0.1776 0.1221 0.1221 0.185000
8 39.00000 0.2590 0.3145 0.3034 0.2220 0.1702 0.1480 0.244200
9 40.00000 0.2960 0.3515 0.3293 0.2664 0.2183 0.1739 0.380175
10 44.00000 0.3330 0.3885 0.3552 0.3108 0.2664 0.1998 0.516150
11 46.00000 0.5365 0.6660 0.6253 0.5402 0.7104 0.3071 0.652125
12 64.00000 0.7400 0.9435 0.8954 0.7696 1.1544 0.4144 0.788100
13 67.00000 0.9435 1.2210 1.1655 0.9990 1.5984 0.5217 1.091500
14 88.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096 1.394900
15 108.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096 1.698300
Probably these results are already way more reasonable than these with imputation packages for cross-sectional data. But remember, imputeTS assumes regular spaced time series - if you can get zoo working you can get even better results, because it also considers the irregular spacing.