I have two files file1.csv
and file2.csv
. file1.csv
contains a stupid
sentence in each row. file2.csv
identify which column it is (type0
corresponding to 0
, type1
corresponding to 1
). I want to do a NLP classification task and I know usually how to do it. But in this situation I am bit confused and do not know how to arrange and organize my dataset, so that I can train my sentences and labels. Appreciate if someone give me a hint to progress.
file1.csv
in the following format,
id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.
file2.csv
in the following format.
id,stupid
0,0
1,1
2,0
My purpose is to classify the stupid sentences.
Assuming that, 100% of the time, there will be a sentence that is semantically correct, and another that isn't, you can just split the type0
and type1
sentences into 2 different examples and classify them individually, e.g.:
id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.
Becomes:
id,sentence
0,He married to a dinosaur
1,He married to a women.
2,She drinks a beer.
3,She drinks a banana.
4,He lifted a 500 tons.
5,He lifted a 50kg.
However, this won't work if your data contains records where a sentence is slightly less stupid than the other, i.e. there's the actual need to compare both sentences.