Search code examples
tensorflow-datasetstensorflow-federated

Loading multiple CSV files (silos) to compose Tensorflow Federated dataset


I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.

To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData, I am getting the error:

TypeError: can't pickle _thread.RLock objects

The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:

    silos = {}
    for silo in silos_files:
        silo_name = silo.replace(".csv", "")
        silos[silo_name] = pd.read_csv(silos_path + silo)
        silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)

Creating a new dict with tensors:

    silos_tf = collections.OrderedDict()
    for key, silo in silos.items():
        silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))

Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:

tff_dataset = tff.simulation.datasets.TestClientData(
    silos_tf
)

That raises the error:

TypeError                                 Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
      1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2     silos_tf
      3 )

/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
     59     """
     60     py_typecheck.check_type(tensor_slices_dict, dict)
---> 61     tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
     62     structures = list(tensor_slices_dict.values())
     63     example_structure = structures[0]

...

/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects

I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:

Is there a reasonable way to create tff clients datat sets?

'tensorflow_federated.python.simulation' has no attribute 'FromTensorSlicesClientData' when using tff-nightly

I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.


Solution

  • I did some search of public code in github using class tff.simulation.datasets.TestClientData, then I found the following implementation (source here):

    def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
        ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:
    
        """Transform dataset to be fed to fedjax
        :param clientsData: dataset for each client
        :param clientsDataLabels:
        :param ds_info: dataset information
        :param train: True if processing train split
        :return: dataset for each client cast into TestClientData
        """
        num_clients = ds_info['num_clients']
    
        client_data = collections.OrderedDict()
    
        for i in range(num_clients if is_train else 1):
            client_data[str(i)] = collections.OrderedDict(
                x=clientsData[i],
                y=clientsDataLabels[i])
    
        return tff.simulation.datasets.TestClientData(client_data)
    

    I understood from this snippet that the tff.simulation.datasets.TestClientData class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:

    silos_tf = collections.OrderedDict()
    for key, silo in silos.items():
        silos_tf[key] = collections.OrderedDict(
                x=silo.drop(columns=["variety"]).values,
                y=silo["variety"].values)
        
    

    Followed by:

    tff_dataset = tff.simulation.datasets.TestClientData(
        silos_tf
    )
    

    That correctly runs as the following output:

    >>> tff_dataset.client_ids
    ['iris3', 'iris1', 'iris2']