The advantage of a set transformer is the ability to handle variable-size inputs. However, I thought a regular transformer would be able to do the same thing. What is the difference between these two models and why should you use one over the other?
Does the set transformer not require positional encoding? Is it just more modular and easier to pick what piece you want to use?
For reference here is the set transformer paper and code
A transformer is a model that works on sets of elements. The traditional use for it is sequences of words, which requires the addition of a positional embedding to each element, making the model not permutation-invariant. By default (without the addition of this positional embedding), a transformer is permutation-invariant.
A transformer is a model architecture composed of N blocks of attention followed by element-wise MLP (with residual connections usually). This works on sets due to its permutation-invariant properties. The only thing that's different for sequences is that you need to inject the information about the position of the elements, by adding a positional embedding to the tokens at the input (making it non permutation-invariant). So yes, you can (you could kinda say that the model is the same, only the preprocessing step differs).