Search code examples
pythontensorflowbert-language-model

BERT skipping the 1st row of test.tsv when predicting


I'm running BERT-Base, Uncased pre-trained model on a news classification problem. Most of the core logic for data preparation was copied from here. I'm running it on a different dataset though, hence relevant changes have been done. I've 490 news articles, and the train, validation, test data ratios are 405 : 45 : 40. These datasets are present in train.tsv, dev.tsv and test.tsv files in the same dir, all without header. The command I'm using for running the classifier is something like this:

python /Users/<username>/Documents/CodeBase/Projects/BERT/run_classifier.py \
--task_name=cola \
--do_train=true \
--do_eval=true \
--do_predict=true \
--data_dir=/Users/<username>/Desktop/NLP_Learning/Fraud\ detection/BERT \
--vocab_file=./vocab.txt \
--bert_config_file=./bert_config.json \
--init_checkpoint=./bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/Users/<username>/Desktop/NLP_Learning/Fraud\ detection/BERT_Model_Pretrained/output \
--do_lower_case=True

Now, even though the training and prediction finishes, trouble is the generated test_results.tsv file contains only 39 rows, which should have been 40. By the looks of it, it seems row-0 of test.tsv is somehow getting skipped. What am I missing here? I've checked all three input data files, and they all contain proper number of records.


Solution

  • Yes, the data formats for cola tasks are very specific. It requires 3 files train.tsv, dev.tsv and test.tsv, for training-set, development/validation set and test set respectively.

    Coming to the data-formats in each TSV files. train.tsv and dev.tsv have same format:

    id class_label segment text

    and both train.tsv and dev.tsv should not have headers.

    However, coming to the test.tsv, below is the format:

    id text (Note that you should not provide the labels or the segment columns).

    More importantly: test.tsv should have a header.