Search code examples
csvpytorchtorchtext

How to format TSV files to use with torchtext?


The way i'm formatting is like:

Jersei  N
atinge  V
média   N
. PU

Programe    V
...

First string in each line is the lexical item, the other is a pos tag. But the empty-line (that i'm using to indicate the end of a sentence) gives me the error AttributeError: 'Example' object has no attribute 'text' when running the given code:

src = data.Field()
trg = data.Field(sequential=False)
mt_train = datasets.TabularDataset(
    path='/path/to/file.tsv',
    fields=(src, trg))
src.build_vocab(train)

How the proper way to indicate EOS to torchtext?


Solution

  • The following code reads the TSV the way i formatted:

    mt_train = datasets.SequenceTaggingDataset(path='/path/to/file.tsv',
                                               fields=(('text', text),
                                                       ('labels', labels)))
    

    It happens that SequenceTaggingDataset properly identifies an empty line as the sentence separator.