Poor loss convergence and accuracy with CNN model

I've built a binary classifier using TF which classify's a 16x16 gray scale image into one of two classes with distribution 87-13. The issue that I'm having is that the model's log loss converges to ~0.4, which is better than random however I cannot get it to improve beyond this.

The vision problem is in the realm of video encoding, This image should provide some understanding to the problem, where images are are either to be or not to be split (0/1) based on their homogeneity. Note squares near edges are more likely sub-split to smaller ones.

When validating the model (1.1e7 examples, 87-13 distribution), I cannot achieve an F1-score better than ~50%.

My training data consists of 2.2e8 examples which are oversampled/undersampled to achieve 50-50 distribution. I'm using a batch size of 1024 a substantial shuffle buffer (the data isn't ordered to begin with). Optimised using Adam, with default hyperparameters.

Things I've tried to improve the performance (test (outcome)):

Larger networks, changing number of layers, activations, convolutional kernel sizes and strides etc (same convergence)
Dropout between dense layers(Same performance as with large nets, worse performance with small nets)
Other Adam hyperparameters (all lead to same convergence, eventually)
Other optimisers (same as above)
Training with very small dataset to test convergence (loss saturates to 0)
Regularising input (no effect)
Varying batch size (just influences the noise in the loss and convergence time)

I've been stuck trying to get the performance to improve, I think I've read every SO question that I could find. Any advice would be a great help.

def cnn_model(features, labels, mode):
#   downsample to 8x8 using 2x2 local averaging
    features_8x8 = tf.nn.avg_pool(
            value=tf.cast(features["x"], tf.float32),
            ksize=[1, 2, 2, 1],
            strides=[1, 2, 2, 1],
            padding="SAME",
            data_format='NHWC'
            )
    conv2d_0 = tf.layers.conv2d(inputs=features_8x8,
                                filters=6,
                                kernel_size=[3, 3],
                                strides=(1, 1),
                                activation=tf.nn.relu,
                                name="conv2d_0")
    pool0 = tf.layers.max_pooling2d(
            inputs=conv2d_0,
            pool_size=(2, 2),
            strides=(2, 2),
            padding="SAME",
            data_format='channels_last'
            )
    conv2d_1 = tf.layers.conv2d(inputs=pool0,
                                filters=16,
                                kernel_size=[3, 3],
                                strides=(3, 3),
                                activation=tf.nn.relu,
                                name="conv2d_1")
    reshape1 = tf.reshape(conv2d_1, [-1, 16])
    dense0 = tf.layers.dense(inputs=reshape1,
                             units=10,
                             activation=tf.nn.relu,
                             name="dense0")
    logits = tf.layers.dense(inputs=dense0,
                             units=1,
                             name="logits")

    # ########################################################

    predictions = {
            "classes": tf.round(tf.nn.sigmoid(logits)),
            "probabilities": tf.nn.sigmoid(logits)
            }

    # ########################################################

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions)

    # ########################################################

    cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
            labels=tf.cast(labels['y'], tf.float32),
            logits=logits
            )

    loss = tf.reduce_mean(cross_entropy)

    # ########################################################

    # Configure the Training Op (for TRAIN mdoe)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimiser = tf.train.AdamOptimizer(learning_rate=0.001,
                                           beta1=0.9,
                                           beta2=0.999,
                                           epsilon=1e-08)
        train_op = optimiser.minimize(
                loss=loss,
                global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train_op)
    # Add evalutation metrics (for EVAL mode)
    eval_metric_ops = {
            "accuracy": tf.metrics.accuracy(
                    labels=labels["y"],
                    predictions=predictions["classes"]),
            }
    return tf.estimator.EstimatorSpec(mode=mode,
                                      loss=loss,
                                      eval_metric_ops=eval_metric_ops)

Solution

It seems that you have done a lot already. My next steps would be visualization of

the dataset: can humans distinguish the classes?
the weights: do they converge / change during training
how do fine-tuned models like VGG work?

Possibly, you are asking for a very difficult vision problem. Can we see the images or get a sample of the data? Then, experienced people could try to come up with a basic model that is (hopefully) working...