Search code examples
pythonmachine-learningneural-networkfeature-extractionpybrain

Pybrain Text Classification: data and input


I have 3 sets of sentences (varying in word counts), but I don't know how to extract features from the text such that the input dimension will remain the same.

For example, I've tried bag-of-words but, since the word-count variation causes input-dimension variation, I eventually get errors.

I would much appreciate it if you could show me an approach to preparing the string data for the neural network.

Thank you!

(Python 2.7 in Windows 7)


Solution

  • How to format the input

    This is an extraction from wikipedia.org


    Here are two simple text documents:

    John likes to watch movies. Mary likes too.


    John also likes to watch football games.


    Based on these two text documents, a dictionary is constructed as:

    {
        "John": 1,
        "likes": 2,
        "to": 3,
        "watch": 4,
        "movies": 5,
        "also": 6,
        "football": 7,
        "games": 8,
        "Mary": 9,
        "too": 10
    }
    

    which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:

    [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
    [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
    


    Your input will remain the same size, regardless of the length of your document. I hope this will help you.