Search code examples
preprocessorneural-networktextinputnormalizestandardized

processing strings of text for neural network input


I understand that ANN input must be normalized, standardized, etc. Leaving the peculiarities and models of various ANN's aside, how can I preprocess UTF-8 encoded text within the range of {0,1} or alternatively between the range {-1,1} before it is given as input to neural networks? I have been searching for this on google but can't find any information (I may be using the wrong term).

  1. Does that make sense?
  2. Isn't that how text is preprocessed for neural networks?
  3. Are there any alternatives?

Update on November 2013

I have long accepted as correct the answer of Pete. However, I have serious doubts, mostly due to recent research I've been doing on Symbolic knowledge and ANN's.

Dario Floreano and Claudio Mattiussi in their book explain that such processing is indeed possible, by using distributed encoding.

Indeed if you try a google scholar search, there exists a plethora of neuroscience articles and papers on how distributed encoding is hypothesized to be used by brains in order to encode Symbolic Knowledge.

Teuvo Kohonen, in his paper "Self Organizing Maps" explains:

One might think that applying the neural adaptation laws to a symbol set (regarded as a set of vectorial variables) might create a topographic map that displays the "logical distances" between the symbols. However, there occurs a problem which lies in the different nature of symbols as compared with continuous data. For the latter, similarity always shows up in a natural way, as the metric differences between their continuous encodings. This is no longer true for discrete, symbolic items, such as words, for which no metric has been defined. It is in the very nature of a symbol that its meaning is dissociated from its encoding.

However, Kohonen did manage to deal with Symbolic Information in SOMs!

Furthermore, Prof Dr Alfred Ultsch in his paper "The Integration of Neural Networks with Symbolic Knowledge Processing" deals exactly with how to process Symbolic Knowledge (such as text) in ANN's. Ultsch offers the following methodologies for processing Symbolic Knowledge: Neural Approximative Reasoning, Neural Unification, Introspection and Integrated Knowledge Acquisition. Albeit little information can be found on those in google scholar or anywhere else for that matter.

Pete in his answer is right about semantics. Semantics in ANN's are usually disconnected. However, following reference, provides insight how researchers have used RBMs, trained to recognize similarity in semantics of different word inputs, thus it shouldn't be impossible to have semantics, but would require a layered approach, or a secondary ANN if semantics are required.

Natural Language Processing With Subsymbolic Neural Networks, Risto Miikkulainen, 1997 Training Restricted Boltzmann Machines on Word Observations, G.E.Dahl, Ryan.P.Adams, H.Rarochelle, 2012

Update on January 2021

The field of NLP and Deep Learning has seen a resurgence in research in the past few years and since I asked that Question. There are now Machine-learning models which address what I was trying to achieve in many different ways.

For anyone arriving to this question wondering on how to pre-process text in Deep Learning or Neural Networks, here's a few helpful topics, none of which are Academic, but simple to understand and which should get you started on solving similar tasks:

At the time I was asking that question, RNN, CNN and VSM were about to start being used, nowadays most Deep Learning frameworks support extensive Word Embeddings. Hope the above helps.

Update January 2023

After recent announcements of ChatGPT, Large Language Models, etc, and since NLP has blown up out of proportions, addressing my original question (processing string of text) at a character level is now possible. The real question is why would you want to do that which is an entirely different topic. For some information on how CNN, RNN, Transformers and other models can achieve that, see this blog post here which explains how character embeddings can be used. Similarly, other sources explain in more detail, such as:


Solution

  • I'll go ahead and summarize our discussion as the answer here.

    Your goal is to be able to incorporate text into your neural network. We have established that traditional ANNs are not really suitable for analyzing text. The underlying explanation for why this is so is based around the idea that ANNs operate on inputs that are generally a continuous range of values and the nearness of two values for an input means some sort of nearness in their meaning. Words do not have this idea of nearness and so, there's no real numerical encoding for words that can make sense as input to an ANN.

    On the other hand, a solution that might work is to use a more traditional semantic analysis which could, perhaps produce sentiment ranges for a list of topics and then those topics and their sentiment values could possibly be used as input for an ANN.