python tensorflow keras tensorflow-federated federated-learning

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.

For example, i load the emnist dataset

emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()

and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.

len(emnist_train.client_ids)

3383


emnist_train.element_type_structure


OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])


example_dataset = emnist_train.create_tf_dataset_for_client(
    emnist_train.client_ids[0])

I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:

fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()

but I get this error

AttributeError: 'tuple' object has no attribute 'element_spec'

because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:

def tff_model_fn() -> tff.learning.Model:
    return tff.learning.from_keras_model(
        keras_model=factory.retrieve_model(True),
        input_spec=fashion_test.element_spec,
        loss=loss_builder(),
        metrics=metrics_builder())

iterative_process = tff.learning.build_federated_averaging_process(
    tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()

To sum up,

Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?

Another solution that comes to my mind is to use the tff.simulation.HDF5ClientData and load manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:

      fileprefix = 'fed_emnist_digitsonly'
      sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
      filename = fileprefix + '.tar.bz2'
      path = tf.keras.utils.get_file(
          filename,
          origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
          file_hash=sha256,
          hash_algorithm='sha256',
          extract=True,
          archive_format='tar',
          cache_dir=cache_dir)

      dir_path = os.path.dirname(path)
      train_client_data = hdf5_client_data.HDF5ClientData(
          os.path.join(dir_path, fileprefix + '_train.h5'))
      test_client_data = hdf5_client_data.HDF5ClientData(
          os.path.join(dir_path, fileprefix + '_test.h5'))

      return train_client_data, test_client_data

My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.

Solution

You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.

So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:

This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.