Search code examples
pytorchhuggingface-transformers

Is there a significant speed improvement when using transformers tokenizer over batch compared to per item?


is calling tokenizer on a batch significantly faster than on calling it on each item in a batch? e.g.

encodings = tokenizer(sentences)
# vs
encodings = [tokenizer(x) for x in sentences]

Solution

  • i ended up just timing both in case it's interesting for someone else

    %%timeit
    for _ in range(10**4): tokenizer("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
    785 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %%timeit
    tokenizer(["Lorem ipsum dolor sit amet, consectetur adipiscing elit."]*10**4)
    266 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)