The recent implementation of the Reformer in HuggingFace has both what they call LSH Self Attention and Local Self Attention, but the difference is not very clear to me after reading the documentation. Both use bucketing to avoid the quadratic memory requirement of vanilla transformers, but it is not clear how they differ.
Is it the case that local self attention only allows queries to attend to keys sequentially near them (i.e., inside a given window in the sentence), as opposed to the proper LSH hashing that LSH self attention does? Or is it something else?
After closely examining the source code, I found that indeed the Local Self Attention attends to the sequentially near tokens.