Search code examples
tensorflowgputensorflow-federated

Multi-GPU TFF simulation errors "Detected dataset reduce op in multi-GPU TFF simulation"


I ran my code for an emotion detection model using Tensorflow Federated simulation. My code work perfectly fine using CPUs only. However, I received this error when trying to run TFF with GPU.

ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iteration.Reduce op will be functional after b/159180073.

What is this error about and how can I fix it? I tried to search many places but found no answer.

Here is the call stack if it help. It is very long so I pasted into this link: https://pastebin.com/b1R93gf1

EDIT:

Here is the code containing iterative_process

def startTraining(output_file):
    
    iterative_process = tff.learning.build_federated_averaging_process(
        model_fn,
        client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.01),
        server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
        use_experimental_simulation_loop=True
    )
    
    flstate = iterative_process.initialize()
    evaluation = tff.learning.build_federated_evaluation(model_fn)
    
    output_file.write(
        'round,available_users,loss,sparse_categorical_accuracy,val_loss,val_sparse_categorical_accuracy,test_loss,test_sparse_categorical_accuracy\n')
    curr_round_result = [0,0,100,0,100,0]
    min_val_loss = 100
    for round in range(1,ROUND_COUNT + 1):
        available_users = fetch_available_users_and_increase_time(ROUND_DURATION_AVERAGE + random.randint(-ROUND_DURATION_VARIATION, ROUND_DURATION_VARIATION + 1))
        if(len(available_users) == 0):
            write_to_file(curr_round_result)
            continue
        train_data = make_federated_data(available_users, 'train')
        flstate, metrics = iterative_process.next(flstate, train_data)
        val_data = make_federated_data(available_users, 'val')
        val_metrics = evaluation(flstate.model, val_data)
        
        curr_round_result[0] = round
        curr_round_result[1] = len(available_users)
        curr_round_result[2] = metrics['train']['loss']
        curr_round_result[3] = metrics['train']['sparse_categorical_accuracy']
        curr_round_result[4] = val_metrics['loss']
        curr_round_result[5] = val_metrics['sparse_categorical_accuracy']
        write_to_file(curr_round_result)

Here is the code for make_federated_data

def make_federated_data(users, dataset_type):
    offset = 0
    if(dataset_type == 'val'):
        offset = train_size
    elif(dataset_type == 'test'):
        offset = train_size + val_size
    
    global LOADED_USER
    for id in users:
        if(id + offset not in LOADED_USER):
            LOADED_USER[id + offset] = getDatasetFromFilePath(filepaths[id + offset])

    return [
        LOADED_USER[id + offset]
        for id in users
    ]
        

Solution

  • TFF does support Multi-GPU, and as the error message says one of two things is happening:

    1. The code is using tff.learning but using the default use_experimental_simulation_loop argument value of False. With multiple GPUs, this must be set to True when using APIs including tff.learning.build_federated_averaging_process. For example, calling with:
    training_process = tff.learning.build_federated_averaging_process(
      ..., use_experimental_simulation_loop=True)
    
    1. The code contains a custom tf.data.Dataset.reduce(...) call somewhere. This must be replaced with Python code that iterates over the dataset. For example:
    result = dataset.reduce(initial_state=0, reduce_func=lambda s, x: s + x)
    

    becomes

    s = 0
    for x in iter(dataset):
      s += x