Search code examples
rapplymissing-dataimputation

Handling missing values in growthcurver


The R package growthcurver is great for efficient analysis and visualization of organism growth except when there are missing values. Because I have data in wide format (each column is a variable) and the times were random for each variable, there are a ton of NAs. Unfortunately, the growthcurver package does not like NAs, so now I'm stuck with 2 options.

  • Option A

    • Impute the missing data via a logistic regression or machine learning approach (I don't like this option because I've tried mice, Hmisc, for regression imputation but failed because there are more variables (columns) than observations in each column and caret for random forest imputation, which did not produce any meaningful imputed values). Imputation also then creates my dataframe to be mostly imputed values which I can't justify.
  • Option B

    • Somehow adapt the growthcurver function to handle NAs better than it currently does. I tried poking around with the function but couldn't find a spot where a simple na.omit() could be plopped in.

Here's the code that worked with the single-use function SummarizeGrowth() (when I manually removed NAs). I should note that this function is useful when one only has a few observations to analyze/visualize but ideally, I would use the function SummarizeGrowthByPlate() which is a package-derived apply() function that loops through each column (variable) automatically producing visualizations and results.

  • Option C
    • Hope the SO community has a quick-fix!

Example Dataframe

        time      a      b      c      d      e      f      g
1    0.00002     NA     NA     NA     NA     NA     NA     NA
2    0.00003     NA     NA     NA     NA     NA     NA 0.0000
3   22.00000     NA     NA     NA     NA     NA     NA     NA
4   24.01000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666     NA
5   24.03000     NA     NA     NA     NA     NA     NA 0.0666
6   28.00000     NA     NA     NA     NA     NA     NA     NA
7   36.00000 0.2220 0.2775 0.2775 0.1776 0.1221 0.1221     NA
8   39.00000     NA     NA     NA     NA     NA     NA 0.2442
9   40.00000     NA     NA     NA     NA     NA     NA     NA
10  44.00000 0.3330 0.3885 0.3552 0.3108 0.2664 0.1998     NA
11  46.00000     NA     NA     NA     NA     NA     NA     NA
12  64.00000     NA     NA     NA     NA     NA     NA 0.7881
13  67.00000 0.9435 1.2210 1.1655 0.9990 1.5984 0.5217     NA
14  88.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096     NA
15 108.00000     NA     NA     NA     NA     NA     NA 1.6983

Reproducible Data

df <- structure(list(time = c(2e-05, 3e-05, 22, 24.01, 24.03, 28, 36, 
39, 40, 44, 46, 64, 67, 88, 108), a = c(NA, NA, NA, 0.1443, NA, 
NA, 0.222, NA, NA, 0.333, NA, NA, 0.9435, 1.8093, NA), b = c(NA, 
NA, NA, 0.1554, NA, NA, 0.2775, NA, NA, 0.3885, NA, NA, 1.221, 
1.8093, NA), c = c(NA, NA, NA, 0.0999, NA, NA, 0.2775, NA, NA, 
0.3552, NA, NA, 1.1655, 1.8093, NA), d = c(NA, NA, NA, 0.111, 
NA, NA, 0.1776, NA, NA, 0.3108, NA, NA, 0.999, 1.887, NA), e = c(NA, 
NA, NA, 0.0999, NA, NA, 0.1221, NA, NA, 0.2664, NA, NA, 1.5984, 
1.6872, NA), f = c(NA, NA, NA, 0.0666, NA, NA, 0.1221, NA, NA, 
0.1998, NA, NA, 0.5217, 1.5096, NA), g = c(NA, 0, NA, NA, 0.0666, 
NA, NA, 0.2442, NA, NA, NA, 0.7881, NA, NA, 1.6983)), class = "data.frame", row.names = c(NA, 
-15L))

Success, but required manual removal of NAs from of a single column with SummarizeGrowth()

library(growthcurver)

SummarizeGrowth(df$time[!is.na(df$a)], df$a[!is.na(df$a)])

Fit data to K / (1 + ((K - N0) / N0) * exp(-r * t)): 
    K   N0  r
  val:  2.121   0.004   0.085
  Residual standard error: 0.02857429 on 2 degrees of freedom

Other useful metrics:
  DT    1 / DT  auc_l   auc_e
  8.13  1.2e-01 38.16   38.77

Failure when not manually removing NAs with SummarizeGrowth()

SummarizeGrowth(df$time, dfb$a)

Fit data to K / (1 + ((K - N0) / N0) * exp(-r * t)): 
    K   N0  r
  val:  0   0   0
  Residual standard error: 0 on 0 degrees of freedom

Other useful metrics:
  DT    1 / DT  auc_l   auc_e
  0 0   0   0

Note: cannot fit data

Failure when trying to use automated SummarizeGrowthByPlate()

SummarizeGrowthByPlate(df)

sample k n0 r t_mid t_gen auc_l auc_e sigma            note
1      a 0  0 0     0     0     0     0     0 cannot fit data
2      b 0  0 0     0     0     0     0     0 cannot fit data
3      c 0  0 0     0     0     0     0     0 cannot fit data
4      d 0  0 0     0     0     0     0     0 cannot fit data
5      e 0  0 0     0     0     0     0     0 cannot fit data
6      f 0  0 0     0     0     0     0     0 cannot fit data
7      g 0  0 0     0     0     0     0     0 cannot fit data

Solution

  • The problem with mice, Hmisc is that they are not doing time series imputation. They only look at the inter-variable correlations. Which means when a row is completely NA - they can't compute anything for this row. (logically there must be at least one regressor in a row to perform a regression)

    Since there seems to be a clear correlation in time for each of your variables you could look at time series imputation / interpolation.

    There is the imputeTS package, which offers a lot of time series imputation algorithms. But would be hard to use it here, since it requires equally-spaced-time series (meaning same time difference between each row)as input. For using this package you would first have to convert the time-series to be equally-spaced. Which does not seem like a good idea for this specific case.

    As far as I know the package zoo can perform time series imputation on irregular spaced time series. So this package might be the best choice for you. I would specifically try the na.approx() - linear interpolation function.

    Unfortunately I can't quickly give an working example. The usage is basically:

    library(zoo)
    na.approx(zooobject)
    

    The only thing you have to figure out now is how to convert your df to a zoo series (which is required as input)

    Just as a showcase that it might be worth the effort - here is a working example with imputeTS (where you do not need a zoo object before)

    library(imputeTS)
    na_interpolation(df)
    
    
            time      a      b      c      d      e      f        g
    1    0.00002 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.000000
    2    0.00003 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.000000
    3   22.00000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.022200
    4   24.01000 0.1443 0.1554 0.0999 0.1110 0.0999 0.0666 0.044400
    5   24.03000 0.1702 0.1961 0.1591 0.1332 0.1073 0.0851 0.066600
    6   28.00000 0.1961 0.2368 0.2183 0.1554 0.1147 0.1036 0.125800
    7   36.00000 0.2220 0.2775 0.2775 0.1776 0.1221 0.1221 0.185000
    8   39.00000 0.2590 0.3145 0.3034 0.2220 0.1702 0.1480 0.244200
    9   40.00000 0.2960 0.3515 0.3293 0.2664 0.2183 0.1739 0.380175
    10  44.00000 0.3330 0.3885 0.3552 0.3108 0.2664 0.1998 0.516150
    11  46.00000 0.5365 0.6660 0.6253 0.5402 0.7104 0.3071 0.652125
    12  64.00000 0.7400 0.9435 0.8954 0.7696 1.1544 0.4144 0.788100
    13  67.00000 0.9435 1.2210 1.1655 0.9990 1.5984 0.5217 1.091500
    14  88.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096 1.394900
    15 108.00000 1.8093 1.8093 1.8093 1.8870 1.6872 1.5096 1.698300
    

    Probably these results are already way more reasonable than these with imputation packages for cross-sectional data. But remember, imputeTS assumes regular spaced time series - if you can get zoo working you can get even better results, because it also considers the irregular spacing.