Suppose for a classification task, I have algorithm A and algorithm B, and a labeled dataset of size M. Both algorithm A and algorithm B are ``deterministic" machine learning approaches, that is to say, the algorithms do not have some parameter which is a random seed so that given different random seeds, even with the same dataset, the trained classifiers can be different.
My question is, if I would like to prove algorithm A is statistically better (or worse) than algorithm B, what should I should?
Well, the way you described your problem, the only way of checking statistical difference is by varying your datasets. Generate several different datasets, and run algorithms A and B on them, comparing the results (it's unclear if your quality metric is the result correctness or the time taken, but it works both ways).