Search code examples
pythonnlpallennlp

How to train a textual entailment model with my own training set?


I'd like to train the decomposable attention + ELMo; SNLI model on the demo with my own dataset. I'm new to nlp. After going through the guide, I still have no idea of how to start off with my own training set consisting of plain text premise, hypothesis, and label. The data format is displayed below.

Based on the training command on demo, I found its training set is https://allennlp.s3.amazonaws.com/datasets/snli/snli_1.0_train.jsonl. How can I generate such a training set with my own data?

FYI. my dataset is like:

{ "premise":"sentences", "hypothesis":"sentences", "label":"x"}
{ "premise":"sentences", "hypothesis":"sentences", "label":"y"}
...

The entry in snli_1.0_train.jsonl is like:

{"annotator_labels": ["neutral"], "captionID": "3416050480.jpg#4", "gold_label": "neutral", "pairID": "3416050480.jpg#4r1n", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is training his horse for a competition.", "sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"}

I really appreciate it if anyone can help. Thanks.


Solution

  • When applying AllenNLP to a new dataset, you usually need to implement a new DatasetReader. In this case you could simply adapt the existing SnliReader to the format of your dataset, or adjust the format of your dataset to work with the existing SnliReader. You can see here that this reader only looks for 3 fields: "gold_labels" (the "label"), "sentence1" (the "premise"), and "sentence2" (the "hypothesis").