Search code examples
tensorflowtextnlptraining-datadata-augmentation

Textual Data Augmentation in Tensorflow


I'm doing a sentiment analysis on the IMDB dataset in tensorflow and I'm trying to augment the training dataset by using the textaugment library which they said is 'plug and play' into tensorflow. So it should be rather simple, but I'm new to tf so I'm not sure how to go about doing that. Here is what I have and what I am trying, based on reading the tutorials on the site.

I tried to do a map to augment the training data but I got an error. You can scroll down to the last code block to see the error.

pip install -q tensorflow-text
pip install -q tf-models-official
import os
import shutil
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization # to create AdamW Optimizer
import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

#Downloading the IMDB dataset and making the train/validation/test sets

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)


AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)


#setting up the textaugment
try:
  import textaugment
except ModuleNotFoundError:
  !pip install textaugment
  import textaugment
from textaugment import EDA
import nltk
nltk.download('stopwords')

Now this is where I get the error, I tried a map on the train_ds and tried to add a random swap to each of the elements while keeping the class the same:

aug_ds = train_ds.map(
    lambda x, y: (t.random_swap(x), y))

Error Message:

AttributeError                            Traceback (most recent call last)
<ipython-input-24-b4af68cc0677> in <module>()
      1 aug_ds = train_ds.map(
----> 2     lambda x, y: (t.random_swap(x), y))

10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    668       except Exception as e:  # pylint:disable=broad-except
    669         if hasattr(e, 'ag_error_metadata'):
--> 670           raise e.ag_error_metadata.to_exception(e)
    671         else:
    672           raise

AttributeError: in user code:

    <ipython-input-24-b4af68cc0677>:2 None  *
        lambda x, y: (t.random_swap(x), y))
    /usr/local/lib/python3.6/dist-packages/textaugment/eda.py:187 random_swap  *
        self.validate(sentence=sentence, n=n)
    /usr/local/lib/python3.6/dist-packages/textaugment/eda.py:74 validate  *
        if not isinstance(kwargs['sentence'].strip(), str) or len(kwargs['sentence'].strip()) == 0:

    AttributeError: 'Tensor' object has no attribute 'strip'

Solution

  • I am also trying to do the same. The error occurs because the textaugment function t.random_swap() is supposed to work on Python string objects.

    In your code, the function is taking in a Tensor with dtype=string. As of now, tensor objects do not have the same methods as Python strings. Hence, the error code.

    Nb. tensorflow_text has some additional APIs to work with such tensors of string types. Albeit, it is limited at the moment to tokenization, checking upper or lower case etc. A long winded workaround is to use the py_function wrapper but this reduces performance. Cheers and hope this helps. I opted not to use textaugment in the end in my use case.

    Nbb. tf.strings APIs have a bit more functionalities, such as regex replace etc but it is not complicated enough for your use case of augmentation. Would be helpful to see what others come up with, or if there are future updates to either TF or textaugment.