CNTK Input Data Structure for example: CSTrainingCPUOnlyExamples

I am using the Example of CNTK: LSTMSequenceClassifier via the Console Application: CSTrainingCPUOnlyExamples, using the default data file: Train.ctf, it looks like this:

The Input Layer is dimension: 2000 ( One Hot Vector ), the Output is: 5 Classes ( Softmax ).

The File is loaded via:

MinibatchSource minibatchSource = MinibatchSource.TextFormatMinibatchSource(Path.Combine(DataFolder, "Train.ctf"), streamConfigurations, MinibatchSource.InfinitelyRepeat, true);

StreamInformation featureStreamInfo = minibatchSource.StreamInfo(featuresName);

StreamInformation labelStreamInfo = minibatchSource.StreamInfo(labelsName);

I would really appreciate how the data file is generated and how 2000 Inputs map to 5 classes Output.

Of course, my goal is to write an application to Format and save Data to a file that can be read as an Input Data File. Of course I would need to understand the Structure to make this work.

Thanks!

I see the Y Dimension, this part makes sense, but am having trouble with the Input Layer.

Solution

Edit: @Frank Seide MSFT

I wonder if you can verify and give best practices:

private string Format(int sequenceId, string featureName, string featureShape, string labelName, string featureComment, string labelShape, string labelComment)
{
    return $"{sequenceId} |{featureName.Replace(" ","-")} {featureShape} |# {featureComment}   |{labelName.Replace(" ","-")} {labelShape} |# {labelComment}\r\n";
}

which might return something like:

0 |x 560:1 |# I am a comment   |y 1 0 0 0 0 |# I am a comment

Where:

sequenceId = 0;
featureName = "x";
featureShape = "560:1";
featureComment = "I am a comment";
labelName = "y";
labelShape = "1 0 0 0 0";
labelComment = "I am a comment";

On GPU, Frank did suggest around 20 Sequences for each Minibatch, see: https://www.youtube.com/watch?v=TK671HxrufE @26:25

This is for custom C# Dataset formatting.

End edit...

An accidental discovery and I found an answer with some Documentation:

BrainScript CNTK Text Format Reader using CNTKTextFormatReader

The documtnet goes on to explain:

CNTKTextFormatReader (later simply CTF Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features: Multiple input streams (inputs) per file Both sparse and dense inputs Variable length sequences CNTK Text Format (CTF) Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more sequence, input, sample relations. Each input line must be formatted as follows: [Sequence_Id](Sample or Comment)+ . where Sample=|Input_Name (Value )* Comment=|# some content Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples). Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id. Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself). Each sample begins with a pipe symbol (|) followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values. Each value is either a number or an index-prefixed number for sparse inputs. Both tabs and spaces can be used interchangeably as delimiters. A comment starts with a pipe immediately followed by a hash symbol: |#, then followed by the actually content (body) of the comment. The body can contain any characters, however a pipe symbol inside the body needs to be escaped by appending the hash symbol to it (see the example below). The body of a comment continues until the end of line or the next un-escaped pipe, whichever comes first.

Handy, and gives an answer.

The input data corresponding to the reader configuration above should look something like this: |B 100:3 123:4 |C 8 |A 0 1 2 3 4 |# a CTF comment |# another comment |A 0 1.1 22 0.3 54 |C 123917 |B 1134:1.911 13331:0.014 |C -0.001 |# a comment with an escaped pipe: '|#' |A 3.9 1.11 121.2 99.13 0.04 |B 999:0.001 918918:-9.19

Note the following about the input format: |Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector. Dense vector is just a list of floating point values; sparse vector is a list of index:value tuples. Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs). Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below). Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement). The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs) Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols.

Some awesome content on the Input and Label Data in this Video:

https://youtu.be/hMRrqkl77rI - @30:23 https://youtu.be/Vi05nEzAS8Y - @25:20

Also, helpful but not directly related: Read and feed data to CNTK Trainer