sentence transformer use of evaluator

I came across this script which is second link on this page and this explanation I am using all-mpnet-base-v2 (link) and I am using my custom data

I am having hard time understanding use of

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    dev_samples, name='sts-dev')

The documentation says:

evaluator – An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.

But in this case, as we are fine tuning on our own examples, train_dataloaderhas train_samples which has our model sentences and scores.

Q1. How is train_samples different than dev_samples?

Q2a: If the model is going to print performance against dev_samples then how is it going to help "to determine the best model that is saved to disc"?

Q2b: Are we required to run dev_samples against the model saved on the disc and then compare scores?

Q3. If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Q4. How to determine total steps in the model? Do I need to set evaluation_steps?

Updated

I followed the answer provided by Kyle and have below follow up questions

In the fit method I used the evaluator and below data was written to a file

Q5. which metric is used to select the best epoch? is it cosine_pearson?

Q6: why steps are -1 in the above output?

Q7a: how to find steps based upon size of my data, batch size etc.

Currently i have kept them to 1000. But not sure if that it is too much. I am running for 10 epochs, i have 2509 examples in the training data and batch size is 64.

Q7b: are my steps going to be 2509/64? if yes then 1000 seems to be too high number

Solution

Question 1

How is train_samples different from dev_samples in the context of the EmbeddingSimilarityEvaluator?

One needs to have a "held-out" split of data to be used for evaluation during training to avoid over-fitting. This "held-out" set is commonly referred to as the "development set" as it is the set of data that is used during development of the model/system. A pedagogical analogy can be drawn between a traditional education curriculum and that of training deep learning models: if one were to give students all the questions for a given topic, and then use the same subset of questions for evaluation, then eventually (most) students will learn to memorise the set of answers they repeatedly see while practicing, instead of learning the procedures to solve the questions in general. So if you are using your own custom data, make sure that a subset of that data is allocated to dev_samples in addition to train_samples and test_samples. Alternatively, if your own data is scarce, you can use the original training data to supplement your own training, development and test sets. The "test set" is the one that is only used after training has completed to determine the final performance of the model (i.e. all samples in the test set (ideally) haven't been seen before).

Question 2

How is the model going to determine the best model that is saved to disc? Are we required to run dev_samples against the model saved on the disc and then compare scores?

The previous answer alludes to how this will work, but in brief, once the evaluator has been instantiated, it will measure the correlation against the gold labels and then return the similarity score (depending on what main_similarity was initially set). If the produced embeddings (based on the development set) offer a higher correlation with their gold labels, and therefore, a higher score overall, then this "better" model is saved to disk. Hence, there is no need for you to "run dev_samples against the model saved on the disc and then compare scores", this process happens automatically provided everything has been set up appropriately.

Question 3

If my goal is to take a single model and then fine tune it, is it okay to skip parameters evaluator and evaluation_steps?

Based on the above answers, you can understand why you cannot "skip the evaluator and evaluation_steps". The evaluator is an integral part of "fine-tuning" (i.e. training) the model.

Question 4

How to determine the total number of steps for the model? I need to set evaluation_steps.

The evaluation_steps parameter sets the number of training steps that must occur before the model is evaluated using the evaluator. If the authors have set this to 1000, then leave it as is unless you notice problems with training. Alternatively, experiment with either increasing of decreasing it and select a value that works best for training.

Follow-Up Questions

Question 5

Which metric is used to select the best epoch? Is it cosine_pearson?

By default, the maximum of the Cosine Spearman, Manhattan Spearman, Euclidean Spearman and Dot Product Spearman is used.

Question 6

Why are steps -1 in the output?

The -1 lets the user know that the evaluator was called after all training steps occurred for a particular epoch.

If the steps_per_epoch was not set when calling the model.fit(), it defaults to None which sets the number of steps_per_epoch to the size of the train_dataloader which is passed to train_objectives when model.fit() is initially called, i.e.:

model.fit(train_objectives=[(train_dataloader, train_loss)],
          ...)

In your case, train_samples is 2,509 and train_batch_size is 64, so the size of train_dataloader, and therefore steps_per_epoch, will be 39.

If the steps_per_epoch, is less than the evaluation_steps, then the number of training steps won't reach or exceed evaluation_steps and so additional calls to _eval_during_training on line 737 won't occur. This isn't a problem as the evaluation is forced to call at the end of each epoch anyway based on line 747.

Question 7

How do I find the number of evaluation_steps based on the size of my training data (2,509 samples) and batch size (64)? Is 1000 too high?

The evaluation_steps is available to tell the model during the training process whether it should prematurely run an evaluation using the evaluator part-way through an epoch. Otherwise, the evaluation is forced to run at the end of the epoch after steps_per_epoch have completed.

Based on the numbers you provided, you could, for example, set evaluation_steps to 20 to get an evaluation to run approx. half-way through an epoch (assuming an epoch is 39 training_steps). See this answer and its question for more info. on batch size vs. epochs vs. steps per epoch.