Search code examples
huggingface-transformersbert-language-modelnlp-question-answering

How to use tapas table question answer model when table size is big like containing 50000 rows?


I am trying to build up a model in which I load the dataframe (an excel file from Kaggle) and I am using TAPAS-large-finetuned-wtq model to query this dataset. I tried to query 259 rows (the memory usage is 62.9 KB). I didn't have a problem, but then I tried to query 260 rows with memory usage 63.1KB, and I have the error which says: "index out of range in self". I have attached a screenshot for the reference as well. The data I used here can be found from Kaggle datasets.

enter image description here

The code I am using is:

from transformers import pipeline
import pandas as pd
import torch

question = "Which Country code has the quantity 30604?"
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")

c = tqa(table=df[:100], query=question)['cells']

In the last line, as you can see in the screenshot, I get the error.

Please let me know what can be the way I can work for a solution? Any tips would be welcome.


Solution

  • The way TAPAS works it needs to flatten the table into a sequence of word pieces. This sequence needs to fit into the specified maximal sequence length (default is 512). TAPAS has a pruning mechanism that will try to drop tokens but it will never drop cells. Therefore at a sequence length of 512 there is no way to fit a table with more than 512 cells.

    If you really want to run the model on 1.8M rows I would suggest that you split your data row-wise. For your table for example you would need blocks with a maximum of ~8 rows.

    Alternatively, you can increase the sequence size but that will also increase the cost of running the model.

    https://github.com/google-research/tapas/issues/14