Search code examples
pythontensorflowtensorflow2.0tensorflow-datasets

Remove quotation mark from .txt


I've a txt file with the following row type:

"Hello I'm in Tensorflow"
"My name is foo"
'Mr "alias" is running'
...

So at it can be seen, just one string per row. When I try to create a tf.data.Dataset, the output looks like this:

conver = TextLineDataset('path_to.txt')
for utter in conver:
    print(utter)
   break
# tf.Tensor(b'"Hello I'm in Tensorflow"', shape=(), dtype=string)

If you notice, the quotation mark " is still present at the beginning and end of the string (plus the defined by the tensor '). My desired output would be:

# tf.Tensor(b'Hello I'm in Tensorflow', shape=(), dtype=string)

That is, without the quotation marks. Thank you in advance


Solution

  • You could use tf.strings.regex_replace:

    import tensorflow as tf
    conver = tf.data.TextLineDataset('/content/text.txt')
    
    def remove_quotes(text):
      text = tf.strings.regex_replace(text, '\"', '')
      text = tf.strings.regex_replace(text, '\'', '')
      return text
    
    conver = conver.map(remove_quotes)
    for s in conver:
      print(s)
    
    tf.Tensor(b'Hello Im in Tensorflow', shape=(), dtype=string)
    tf.Tensor(b'My name is foo', shape=(), dtype=string)
    tf.Tensor(b'Mr alias is running', shape=(), dtype=string)
    

    Or if you just want to remove the leading and trailing quotes then try this:

    text = tf.strings.regex_replace(text, '^[\"\']*|[\"\']*$', '')