Search code examples
pythonpython-3.xnlptext-classification

Identifying Grammatically Correct Nonsense Sentences


I have two files file1.csv and file2.csv. file1.csv contains a stupid sentence in each row. file2.csv identify which column it is (type0 corresponding to 0, type1 corresponding to 1). I want to do a NLP classification task and I know usually how to do it. But in this situation I am bit confused and do not know how to arrange and organize my dataset, so that I can train my sentences and labels. Appreciate if someone give me a hint to progress.

file1.csv in the following format,

id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.

file2.csv in the following format.

id,stupid
0,0
1,1
2,0

My purpose is to classify the stupid sentences.


Solution

  • Assuming that, 100% of the time, there will be a sentence that is semantically correct, and another that isn't, you can just split the type0 and type1 sentences into 2 different examples and classify them individually, e.g.:

    id,type0,type1
    0,He married to a dinosaur.,He married to a women.
    1,She drinks a beer.,She drinks a banana.
    2,He lifted a 500 tons.,He lifted a 50kg.
    

    Becomes:

    id,sentence
    0,He married to a dinosaur
    1,He married to a women.
    2,She drinks a beer.
    3,She drinks a banana.
    4,He lifted a 500 tons.
    5,He lifted a 50kg.
    

    However, this won't work if your data contains records where a sentence is slightly less stupid than the other, i.e. there's the actual need to compare both sentences.