I am building a federated learning model using my own dataset. I aim to build a multi classification model. The data are presented in separate 8 CSV files.
I followed the instructions in this post As shown in the code below.
dataset_paths = {
'client_0': '/content/ds1.csv',
'client_1': '/content/ds2.csv',
'client_2': '/content/ds3.csv',
'client_3': '/content/ds4.csv',
'client_4': '/content/ds5.csv',
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.datasets.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
but it gave me this error
AttributeError: type object 'ClientData' has no attribute 'from_clients_and_fn'
I was reading this documentation and found that .datasets
methods would work, so I replaced with .from_clients_and_fn
and the error disappeared but I dont know if it is right and what is next?
My questions are:
and thanks in advance
In this setup it maybe useful to consider tff.simulation.datasets.FilePerUserClientData
and tf.data.experimental.CsvDataset
This might look like (this makes some test CSV data for the sake of the example, the dataset your working with likely has other shapes):
dataset_paths = {
'client_0': '/content/ds1.csv',
'client_1': '/content/ds2.csv',
'client_2': '/content/ds3.csv',
'client_3': '/content/ds4.csv',
'client_4': '/content/ds5.csv',
# Create some test data for the sake of the example,
# normally we wouldn't do this.
for i, (id, path) in enumerate(dataset_paths.items()):
with open(path, 'w') as f:
for _ in range(i):
# Values that will fill in any CSV cell if its missing,
# must match the dtypes above.
record_defaults = ['', 0.0, 0]
def create_tf_dataset_for_client_fn(dataset_path):
return tf.data.experimental.CsvDataset(
dataset_path, record_defaults=record_defaults )
source = tff.simulation.datasets.FilePerUserClientData(
dataset_paths, create_tf_dataset_for_client_fn)
>>> ['client_0', 'client_1', 'client_2', 'client_3', 'client_4']
for x in source.create_tf_dataset_for_client('client_3'):
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
It may be possible to concatenate all the data into a single CSV, but each record would still need some identifier indicating which row belongs to which client. Mixing all the rows together without any kind of per-client mapping would be akin to standard centralized training, not federated learning.
Once a CSV has all the rows, and perhaps a column with a client_id
value, one could presumably use tf.data.Dataset.filter()
to only yield the rows belonging to a particular client. This probably won't be particularly efficient though, as it would iterate over the entire global dataset for each client, rather than only that client's examples.