python nlp pytorch text-classification torchtext

Load a plain text file into PyTorch

I have two separate files, one is a text file, with each line being a single text. The other file contains the class label of that corresponding line. How do I load this into PyTorch and carry out further tokenization, embedding, etc?

Solution

What have you tried already? What you described is still not very PyTorch related, you can make a pre-processing script that loads all the sentences into single data structured, e.g.: a list of (text, label) tuple.You can also already split your data into training and hold-out set in this step. You can then dump all this into .csv files.

Then, one way to do it is in 3 steps:

Implement the class Dataset - to load efficiently your data, reading the produced .csv files;
Have another like Vocabulary that keeps a mapping from tokens to ids and vice-verse;
Something like a Vectorizer, that converts your sentences into vectors, either one-hot-encondings or embeddings;

Then you can use this to produce a vector representation of your sentences a pass it to a neural network.

Look into this notebook to understand all this in more detail:

Sentiment Classification