I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.
To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData
, I am getting the error:
TypeError: can't pickle _thread.RLock objects
The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
Creating a new dict with tensors:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That raises the error:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:
Is there a reasonable way to create tff clients datat sets?
I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.
I did some search of public code in github using class tff.simulation.datasets.TestClientData
, then I found the following implementation (source here):
def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:
"""Transform dataset to be fed to fedjax
:param clientsData: dataset for each client
:param clientsDataLabels:
:param ds_info: dataset information
:param train: True if processing train split
:return: dataset for each client cast into TestClientData
"""
num_clients = ds_info['num_clients']
client_data = collections.OrderedDict()
for i in range(num_clients if is_train else 1):
client_data[str(i)] = collections.OrderedDict(
x=clientsData[i],
y=clientsDataLabels[i])
return tff.simulation.datasets.TestClientData(client_data)
I understood from this snippet that the tff.simulation.datasets.TestClientData
class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = collections.OrderedDict(
x=silo.drop(columns=["variety"]).values,
y=silo["variety"].values)
Followed by:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That correctly runs as the following output:
>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']