The validation accuracy gets lower when the number of workers increases in Federated Learning with non-IID dataset

I use human activity recognition (HAR) dataset with 6 classes using federated learning (FL). In this case, I implement the non-IID dataset by assigning (1) each class dataset to different 6 workers, (2) two classes to 3 different workers, and (3) three classes to 2 different workers.

When I run the FL process, the validation accuracy for scenario (3) > (2) > (1). I expect that all scenarios will obtain almost the same validation accuracy. For each scenario, I use the same hyperparameter settings including batch size, shuffle buffer, and the model configuration.

Is it common in FL with the non-IID dataset or is there any problem with my result?

Solution

The scenario where each worker has only one (and all of) one label can be considered the "pathologically bad" non-IID for Federated Averaging.

In this scenario, its possible that each worker learns to predict only the label it has. The model does not need to discriminate on any features: if a worker only has class 1, it can predict class 1 and obtain 100% accuracy. Each round, when all of the model updates are averaged, the global is back to a model that only predicts each class with 1/6 probability.

The closer each workers distribution of examples is to the global distribution (or each other, i.e. the more IID the client datasets are), the closer its local training will produce an update to the global model that is in the same direction as the averaged update, leading to better training results.