Search code examples
pythonnlppytorchtext-classificationtorchtext

Load a plain text file into PyTorch


I have two separate files, one is a text file, with each line being a single text. The other file contains the class label of that corresponding line. How do I load this into PyTorch and carry out further tokenization, embedding, etc?


Solution

  • What have you tried already? What you described is still not very PyTorch related, you can make a pre-processing script that loads all the sentences into single data structured, e.g.: a list of (text, label) tuple.You can also already split your data into training and hold-out set in this step. You can then dump all this into .csv files.

    Then, one way to do it is in 3 steps:

    • Implement the class Dataset - to load efficiently your data, reading the produced .csv files;
    • Have another like Vocabulary that keeps a mapping from tokens to ids and vice-verse;
    • Something like a Vectorizer, that converts your sentences into vectors, either one-hot-encondings or embeddings;

    Then you can use this to produce a vector representation of your sentences a pass it to a neural network.

    Look into this notebook to understand all this in more detail: