Loss function that ignores missing target values in keras for R

I am fitting a LSTM model to a multivariate time series using the keras R-package (answer regarding keras in Python or PyTorch would also be helpful, as I could switch) and have multiple outputs (3 continuous, one categorical). Some of the targets are missing for some time steps (coded as -1, because all observed values are $\geq 0$, but I could obviously change that to anything else). What I think would make sense is that any prediction by the model is considered right (=no loss incurred), if the target variable is missing (=-1). I have no interest in predicting whether values are missing, so forcing the model to output -1 is of no interest to me, even if the model could reliably predict the missingness. I'd much rather get a prediction of what the missing value would be (even if I have no way of checking whether that is correct).

How do I create a custom loss function that "ignores" -1 values / considers them correct?

In case more of the context matters, below is a diagram illustrating my model and below that R code to generate some example data and fit a model in case there's no missing data. Once you remove the commenting-out of the # %>% mutate_at(vars(x1:x4, y1:y4), randomly_set_to_minus_one) line in the code below, you get some inputs and outputs coded to -1. I don't have a strong opinion how these should be coded as features, I could also set the values to the median input value and add a flag for missing or something else. Where it (to me) really matters is that my loss function deals with -1 target values correctly. At the end of the post I have my failed attempt to write such a loss function.

library(tidyverse)
library(keras)

# A function I use to set some values randomly to -1
randomly_set_to_minus_one = function(x){
  ifelse(rnorm(length(x))>1, -1, x)
}
# randomly_set_to_minus_one(rnorm(100))

set.seed(1234)
subjects = 250
records_per_subject = 25

# Simulate some time series for multiple subject with multiple records per subject.
example = tibble(subject = rep(1:subjects, each=records_per_subject),
       rand1 = rep(rnorm(subjects), each=records_per_subject),
       rand2 = rep(rnorm(subjects), each=records_per_subject),
       rand3 = rnorm(subjects*records_per_subject),
       rand4 = rnorm(subjects*records_per_subject)) %>%
  mutate(x1 = 0.8*rand1 + 0.2*rand2 + 0.8*rand3 + 0.2*rand4 + rnorm(n=n(),sd=0.1),
         x2 = 0.1*rand1 + 0.9*rand2 + 2*rand3 + rnorm(n=n(),sd=0.1),
         x3 = 0.5*rand1 + 0.5*rand2 + 0.2*rand4 + rnorm(n=n(),sd=0.25),
         x4 = 0.2*rand1 + 0.2*rand2 + 0.5*rand3 + 0.5*rand4 + rnorm(n=n(),sd=0.1),
         x5 = rep(1:records_per_subject, subjects),
         y1 = 1+tanh(rand1 + rand2 + 0.05*rand3 + 0.05*rand4 + 2*x5/records_per_subject + rnorm(n=n(),sd=0.05)),
         y2 = 10*plogis(0.2*rand1 + 0.2*rand2 + 0.2*rand3 + 0.2*rand4),
         y3 = 3*plogis(0.8*rand1 + 0.8*rand4 + 2*(x5-records_per_subject/2)/records_per_subject),
         prob1 = exp(rand1/4*3+rand3/4),
         prob2 = exp(rand2/4*3+rand4/4),
         prob3 = exp(-rand1-rand2-rand3-rand4),
         total = prob1+prob2+prob3,
         prob1 = prob1/total,
         prob2 = prob2/total,
         prob3 = prob3/total,
         y4 = pmap(list(prob1, prob2, prob3), function(x,y,z) sample(1:3, 1, replace=T, prob=c(x,y,z)))) %>%
  unnest(y4) %>%
  mutate(x1 = x1 + min(x1),
         x2 = x2 + min(x2),
         x3 = x3 + min(x3),
         x4 = x4 + min(x4)) %>%
  dplyr::select(subject, x1:x5, y1:y4) 
# %>% mutate_at(vars(x1:x4, y1:y4), randomly_set_to_minus_one)
  
# Create arrays the way keras wants them as inputs/outputs:
# 250, 25, 5 array of predictors
x_array = map(sort(unique(example$subject)), function(x) {
  example %>%
    filter(subject==x) %>%
    dplyr::select(x1:x5) %>%
    as.matrix()
}) %>%
  abind::abind(along=3 ) %>%
  aperm(perm=c(3,1,2))

# 250, 25, 3 array of continuous target variables
y13_array = map(sort(unique(example$subject)), function(x) {
  example %>%
    filter(subject==x) %>%
    dplyr::select(y1:y3) %>%
    as.matrix()
}) %>%
  abind::abind(along=3 ) %>%
  aperm(perm=c(3,1,2))

# 250, 25, 1 array of categorical target variables (one-hot-encoded)
y4_array = map(sort(unique(example$subject)), function(x) {
  example %>%
    filter(subject==x) %>%
    mutate(y41 = case_when(y4==1~1, y4==-1~-1, TRUE~0),
           y42 = case_when(y4==2~1, y4==-1~-1, TRUE~0),
           y43 = case_when(y4==3~1, y4==-1~-1, TRUE~0)) %>%
    dplyr::select(y41:y43) %>%
    as.matrix()
}) %>%
  abind::abind(along=3 ) %>%
  aperm(perm=c(3,1,2))

# Define LSTM neural network
nn_inputs <- layer_input(shape = c(dim(x_array)[2], dim(x_array)[3])) 

nn_lstm_layers <- nn_inputs %>%
  layer_lstm(units = 32, return_sequences = TRUE, 
             dropout = 0.3, # That's dropout applied to the inputs, the below is recurrent drop-out applied to LSTM memory cells
             recurrent_dropout = 0.3) %>%
  layer_lstm(units = 16,
             return_sequences = TRUE, 
             dropout = 0.3, 
             recurrent_dropout = 0.3)

# First continuous output (3 variables)
cont_target <- nn_lstm_layers %>%
  layer_dense(units = dim(y13_array)[3], name = "cont_target")

# Categorical outcome (3 categories one-hot-encoded)
cat_target <- nn_lstm_layers %>%
  layer_dense(units = dim(y4_array)[3], activation = "sigmoid", name = "cat_target")

model <- keras_model(nn_inputs,
                     list(cont_target, cat_target))
summary(model)

val_samples = sample(x=c( rep(FALSE, floor(dim(x_array)[1]*0.8)),
                          rep(TRUE, ceiling(dim(x_array)[1]*0.2))),
                     size = dim(x_array)[1],
                     replace = F)

model %>% compile(
  optimizer = "rmsprop",
  loss = list( cont_target = "mse", 
               cat_target = "categorical_crossentropy"),
  loss_weights = list(cont_target = 1.0, cat_target = 1.0))

history <- model %>% 
  fit(
    x_array[!val_samples,,], 
    list(cont_target = y13_array[!val_samples,,], 
         cat_target = y4_array[!val_samples,,]),
    epochs = 100, 
    batch_size = 32,
    validation_data = list(x_array[val_samples,,], 
                           list(cont_target = y13_array[val_samples,,], 
                                cat_target = y4_array[val_samples,,])),
    callbacks = list(callback_reduce_lr_on_plateau(
      monitor = "val_loss", factor = 0.5, patience = 10, verbose = 0, 
      mode = "min", min_delta = 1e-04, cooldown = 0, min_lr = 0),
      callback_early_stopping(monitor = "val_loss", 
                              min_delta = 0,
                              patience = 20,
                              restore_best_weights = TRUE,
                              verbose = 0, mode = c("auto")))
  )

plot(history) + scale_y_log10()

Here's my attempt at writing a modified MSE-loss function that ignores -1 values:

# Custom loss functions to deal with missing values (coded as -1)
mse_na_loss <- function(y_true, y_pred){
  K <- backend()
  #K$mean( K$switch(K$equal(y_true, -1), K$zeros(shape=K$constant(y_true)$shape), K$pow(y_true-y_pred, 2)), axis=-1)
  #K$mean( K$pow(y_true-y_pred, 2))
  #K$zeros(shape=K$constant(y_true)$shape)
  #K$equal(y_true, -1)
  K$mean(
  K$switch( K$equal(y_true, -1),
            K$zeros(shape=K$constant(y_true)$shape, dtype = "float64"),
            K$pow(y_true-y_pred, 2)),
  axis=-1L)
}

Solution

What I think would make sense is that any prediction by the model is considered right (=no loss incurred), if the target variable is missing (=-1).

You could achieve this (=no loss incurred), by checking if y_true is different from -1 (k_not_equal) and then converting binary to numeric (k_cast). This would give you values such as (1,0,1,1,0) which can multiple with MSE.

mse_na_loss <- function(y_true, y_pred){
  k_pow(y_true-y_pred, 2) * k_cast(k_not_equal(y_true, -1), 'float32')
}

This would basically give you the loss function that you tried to make at the end of your question. And answer the quoted part of your question.

However, I don't think this is a good way to go. This loss function doesn't "ignore" those observations as you stated. It just learns that any value fits here. Which might introduce unnecessary noise to your learning.

Based on the domain, other NA handling methods such as 'last observation carried forward' (na.locf) might be better replacement than -1.