Tensorflow federated (TFF) 0.19 performs significantly worse than TFF 0.17 when running "Building Your Own Federated Learning Algorithm" tutorial

At the very end the "Building Your Own Federated Learning Algorithm" tutorial it is stated ,after training our model for 15 rounds, we shall expect a sparse_categorical_accuracy around 0.25, but running the tutorial in colab as is gives a result between 0.09 and 0.11 based on my runs. Yet simply changing the tf and tff versions to 2.3.x and 0.17, respectively, gives a result around 0.25, just like we expected!

To replicate run the said tutorial as is, it should use tf 2.5 and tff 0.19. After that run the same tutorial by simply changing

!pip install --quiet --upgrade tensorflow-federated

!pip install --quiet tensorflow==2.3.0
!pip install --quiet --upgrade tensorflow-federated==0.17.0

Also tf 2.4 and tff 0.18 combination works just fine and gives a score around 0.25. So it is only tf 2.5 and tff 0.19 combination that doesnt give the expected result.

Just to be clear I am not saying first setup doesnt train the model; running it for 200 rounds shows a steady improvement in score reaching something like 0.7-0.8. I would appreciate a clarification on why thats the case, or if I made something wrong please point it out.

Edit: To make sure same clients were being used across different tff versions I have used the following codes

for training data

sorted_client_ids = sorted(emnist_train.client_ids)
sorted_client_ids2 = sorted_client_ids[0:10]

federated_train_data = [preprocess(emnist_train.create_tf_dataset_for_client(x))
                       for x in sorted_client_ids2]

for test data

sorted_client_ids = sorted(emnist_test.client_ids)
sorted_client_ids2 = sorted_client_ids[0:100]

def data(client, source=emnist_test):
    return preprocess(source.create_tf_dataset_for_client(client))

central_emnist_test = (tf.data.Dataset.from_tensor_slices(
    [data(client) for client in sorted_client_ids2])).flat_map(lambda x: x)

I trained each for 50 rounds. The results I got with these settings are

for tff 0.17: loss: 1.8676 - sparse_categorical_accuracy: 0.5115

for tff 0.18: loss: 1.8503 - sparse_categorical_accuracy: 0.5160

for tff 0.19: loss: 2.2007 - sparse_categorical_accuracy: 0.1014

So my problem here is all three versions of tff had used same training data, same test data, the models had the same initialization and same rounds of training but the results for tff 0.19 and tff 0.18/0.17 was vastly different, whereas tff 0.18 and 0.17 had produced quite similar results.

Again just to clarify tff 0.19 had improved its accuracy as well, but to a significantly lesser degree.

EDIT 2: Following the advice of Zachary Charles I have used federated sgd. For tff 0.18 and 0.17 edit the first line.

!pip install --quiet --upgrade tensorflow-federated
!pip install --quiet --upgrade nest-asyncio

import nest_asyncio
nest_asyncio.apply()

import collections
import attr
import functools
import numpy as np
import tensorflow as tf
import tensorflow_federated as tff

np.random.seed(0)

print(tf.__version__)
print(tff.__version__)

emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()

NUM_CLIENTS = 10
BATCH_SIZE = 20

def preprocess(dataset):
    def batch_format_fn(element):
        return(tf.reshape(element['pixels'],[-1,784]),
              tf.reshape(element['label'],[-1,1]))
    return dataset.batch(BATCH_SIZE).map(batch_format_fn)

sorted_client_ids = sorted(emnist_train.client_ids)
sorted_client_ids2 = sorted_client_ids[0:10]

federated_train_data = [preprocess(emnist_train.create_tf_dataset_for_client(x))
                       for x in sorted_client_ids2]

def create_keras_model():
    return tf.keras.models.Sequential([
        tf.keras.layers.InputLayer(input_shape=(784,)),
        tf.keras.layers.Dense(10, kernel_initializer='zeros'),
        tf.keras.layers.Softmax(),
    ])

def model_fn():
    keras_model = create_keras_model()
    return tff.learning.from_keras_model(
        keras_model,
        input_spec=federated_train_data[0].element_spec,
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
    
sorted_client_ids = sorted(emnist_test.client_ids)
sorted_client_ids2 = sorted_client_ids[0:10]

def data(client, source=emnist_test):
    return preprocess(source.create_tf_dataset_for_client(client))

central_emnist_test = (tf.data.Dataset.from_tensor_slices(
    [data(client) for client in sorted_client_ids2])).flat_map(lambda x: x)

def evaluate(server_state):
    keras_model = create_keras_model()
    keras_model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]  
    )
    keras_model.set_weights(server_state)
    keras_model.evaluate(central_emnist_test)


iterative_process = tff.learning.build_federated_sgd_process(
    model_fn,
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.01))

state = iterative_process.initialize()
evaluate(state.model.trainable)

for round in range(50):
    print(round)
    state,_ = iterative_process.next(state, federated_train_data)

evaluate(state.model.trainable)

The results I got are

Before training

tff 0.19: loss: 2.3026 - sparse_categorical_accuracy: 0.1207
tff 0.18: loss: 2.3026 - sparse_categorical_accuracy: 0.1010
tff 0.17: loss: 2.3026 - sparse_categorical_accuracy: 0.1207

After training

tff 0.19: loss: 2.2122 - sparse_categorical_accuracy: 0.1983
tff 0.18: loss: 2.2158 - sparse_categorical_accuracy: 0.1700
tff 0.17: loss: 2.2122 - sparse_categorical_accuracy: 0.1983

Solution

TFF 0.19 moved the provided datasets (including EMNIST, which is used in the tutorial) away from an HDF5-backed implementation to a SQL-backed implementation (commit). It's possible that this changed the ordering of the clients, which would change which clients are used for training in the tutorial.

It's worth noting that in most simulations, this should not change anything. Clients should generally be randomly sampled at each round (which is not done in the tutorial for reasons of exposition) and generally at least 100 rounds should be done (as you say).

I'll update the tutorial to guarantee reproducibility by sorting the client ids, and then selecting them in order.

For anyone who's interested, a better practice would be to a) sorting the client ids, and then b) sample using something like np.random.RandomState, as in the following snippet:

emnist_train, _ = tff.simulation.datasets.emnist.load_data()
random_state = np.random.RandomState(seed=1729)
sorted_client_ids = sorted(emnist_train.client_ids)
sampled_client_ids = random_state.choice(sorted_client_ids, size=NUM_CLIENTS)