How to encode the complex data for the neural network in the best way?

The data consist from the several records. A record is as the following: [bit vector, numeric vector, a few numeric values]. Bit vector has the different length for each record and the same is true for the numeric vector. The number of numeric values per the record is the constant for all the records.

The output is 2 numbers. Their values (both in the range [0.0, 1.0]) are used for evaluation/fitness function approximation in a search algorithm.

So, my question is: How to represent/normalize these data for the neural network? In particular, is there a (tricky) way to represent the bit vector compactly? Its length can be up to a few thousands.

Solution

Aside from few classical problems, there is no single right way to feed complex data into NN. It is sort of an art and in fact recent progress in deep learning owns a lot to advances in a ways of representing complex data.

Thus, without knowing the nature of your data it is difficult to give any specific advice. Why do you have variable length vectors? Are they representing sequences of some sort? What is encoded in a bit vector?

From pure technical point of view, variable length data means you need either padding with zeros to constant length (simplest but not usually good) or special NN architecture like convolutional or recurrent net and the choice will depend on nature of your dataset. If your bit vector represents a set of some kind of binary features, then you need one neuron per bit, or perhaps you can try to train compact real-valued embeddings with autoencoders.

To get more useful answer, please describe the nature of your problem and post your question to stats.stackexchange.com