Search code examples
cloud-document-ai

Can we pass table column info to help FormParser determine header_row contents?


Suppose I have a pdf file containing the following table info

Trainer: Giannis

Pokedex: Incomplete

Name Type Weight Height Color
Pikachu Electric 6.0 kg 0.4 m Yellow
Bulbasaur Grass/Poison 6.9 kg 0.7 m Green
Charizard Fire/Flying 90.5 kg 1.7 m Orange
Jigglypuff Normal/Fairy 5.5 kg 0.5 m Pink
Gyarados Water/Flying 235.0 kg 6.5 m Blue

I am using the Form Parser to extract the table information.

If I know that the table columns will always be [Name, Type, ... , Color] is there a way to pass this info to the FormParser processor to help it better determine the header rows?

Thank u in advance for your time!


Solution

  • You can't add any "hints" for the Form Parser to adjust the model at this time. You can try using a different version of the Form Parser model to see if the results are more like what you would expect.

    To extract values from a document using a custom defined schema like you are suggesting, you will likely get the best results using a Custom Document Extractor. You can follow this guide for instructions on how to build a custom processor, and this section about Quick Tables in the labeling documentation could be useful to speed up labeling for tabular data.