Search code examples
pythonpython-3.xtensorflowtensorflow-datasets

Getting encoded output when I print hindi text from a tensorflow dataset


I'm using this corpus for an NLP task. When I read the file and store the hindi and english lines into separate lists, I get string literal outputs like so:

def extract_lines(fp):
    return [line.strip() for line in open(fp).readlines()]

inp,target = extract_lines(train_hi),extract_lines(train_en)

sample: ['अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें', 'एक्सेर्साइसर पहुंचनीयता अन्वेषक'] ['Give your application an accessibility workout', 'Accerciser Accessibility Explorer']

I then create a tensorflow dataset using the two lists:

buffer_size = len(inp)
batch_size = 64

dataset = tf.data.Dataset.from_tensor_slices((inp,target)).shuffle(buffer_size)
dataset = dataset.batch(batch_size)

The output I get from

for input_sample,target_sample in dataset.take(1):
    print(input_sample)

is something like:

tf.Tensor( [b'\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa5\x8b\xe0\xa4\x82\xe0\xa4\x95\xe0\xa5\x80 \xe0\xa4\x95\xe0\xa5\x8b\xe0\xa4\x9f\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x81'

I'm pretty new to dealing with text data (especially in tensorflow), what is happening here?


Solution

  • Tensorflow converts all unicode strings such as the Hindi text to utf-8 by default. Check this guide for more details. If you want to view your data, you can decode the encoded string tensors like this:

    import tensorflow as tf
    
    def extract_lines(fp):
        return [line.strip() for line in fp]
    
    inp,target = extract_lines(['अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें', 'एक्सेर्साइसर पहुंचनीयता अन्वेषक'] ),extract_lines(['Give your application an accessibility workout', 'Accerciser Accessibility Explorer'])
    
    buffer_size = len(inp)
    batch_size = 1
    
    dataset = tf.data.Dataset.from_tensor_slices((inp,target)).shuffle(buffer_size)
    dataset = dataset.batch(batch_size)
    
    for x, y in dataset:
      print("".join([chr(i) for i in tf.strings.unicode_decode(x, 'utf-8').to_tensor()[0]]), y)
    
    एक्सेर्साइसर पहुंचनीयता अन्वेषक tf.Tensor([b'Accerciser Accessibility Explorer'], shape=(1,), dtype=string)
    अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें tf.Tensor([b'Give your application an accessibility workout'], shape=(1,), dtype=string)
    

    But note that as soon as the Hindi-text is converted to tf tensors, it will be utf-8 encoded.