Search code examples
cntk

How can I access the comment columns when using CNTKTextFormat reader?


I can't figure out how to access the comment columns in my data files that are in CNTKTextFormat. For example, in this tutorial, you have the following:

19  |S0 178:1 |# BOS      |S1 14:1 |# flight  |S2 128:1 |# O
19  |S0 770:1 |# show                         |S2 128:1 |# O
19  |S0 429:1 |# flights                      |S2 128:1 |# O

How can I access the commented data?


Solution

  • If you instantiate your minibatch source like this:

    data_source = (CTFDeserializer("mydata.ctf", ...), randomize=False, ...)
    

    you can then open the input file you passed to CTFDeserializer with Python and parse it minibatch by minibatch. It is very important to set randomize=False otherwise the reader and your manual parsing below will not be in sync. For example if the file object is stream and the minibatch size is batch_size the following code will print each sequence's commented columns as a dictionary mapping the preceding column name (S0, S1, or S0) to the sequence of strings found in the comment columns.

    from itertools import groupby from collections import defaultdict

     stream = open("mydata.ctf")
    
     lines = [stream.readline() for i in range(batch_size)]
     for seqid, sequence in groupby(lines, lambda s:s.split()[0]):
         mapping = defaultdict(list)
         for sample in sequence:
             parts = sample.split('|')
             unused = [mapping[parts[i-1].split(' ')[0].strip()].append(p.strip()) for i, p in enumerate(parts) if p.startswith('#')]
         print(seqid, mapping)
    

    For the above example input it produces:

    19 defaultdict(<class 'list'>, {'S0': ['# BOS', '# show', '# flights'], 'S2': ['# O', '# O', '# O'], 'S1': ['# flight']})
    

    This example will work for the above input format. If your actual format is different you will have to adapt this for your purposes.