Search code examples
pythontensorflowtensorflow-datasets

Best way to process texts and decode videos while loading the data in Tensorflow


I have a dataframe which looks like this: enter image description here

I'm building a model which takes text and video as input. So, my aim is to load the Text and Media_location (which contains video files path) from the dataframe, so that it is iterable when I feed df['Text'] and the video (loaded from path df['Media_location']) together.

I couldn't find any implemenations in tensorflow that would do this sort of thing, so drop any suggestions you may have.


Solution

  • You can try using tensorflow-io, which will run in graph mode. Just run pip install tensorflow-io and then try:

    import tensorflow as tf
    import tensorflow_io as tfio
    import pandas as pd
    
    df = pd.DataFrame(data={'Text': ['some text', 'some more text'],
                            'Media_location': ['/content/sample-mp4-file.mp4', '/content/sample-mp4-file.mp4']})
    
    dataset = tf.data.Dataset.from_tensor_slices((df['Text'], df['Media_location']))
    
    def decode_videos(x, y):
      video = tf.io.read_file(y)
      video = tfio.experimental.ffmpeg.decode_video(video)
      return x, video
    
    dataset = dataset.map(decode_videos)
    
    for x, y in dataset:
      print(x, y.shape)
    
    tf.Tensor(b'some text', shape=(), dtype=string) (901, 270, 480, 3)
    tf.Tensor(b'some more text', shape=(), dtype=string) (901, 270, 480, 3)
    

    In this example, each video contains 901 frames.

    If you are a Windows users, you can try using cv2 like this:

    import tensorflow as tf
    import pandas as pd
    from cv2 import cv2
    import numpy as np
    
    df = pd.DataFrame(data={'Text': ['some text', 'some more text'],
                            'Media_location': ['/content/sample-mp4-file.mp4', '/content/sample-mp4-file.mp4']})
    
    dataset = tf.data.Dataset.from_tensor_slices((df['Text'], df['Media_location']))
    
    
    def get_video_asarray(path):
      frames = []
      cap = cv2.VideoCapture(path.numpy().decode("utf-8"))
      read = True
      while read:
          read, img = cap.read()
          if read:
            frames.append(img)
      return np.stack(frames, axis=0)
    
    
    def decode_videos(x, y):
      y = tf.py_function(get_video_asarray, [y], Tout=[tf.float32])
      return x, tf.squeeze(y, axis=0)
    
    dataset = dataset.map(decode_videos)
    
    for x, y in dataset:
      print(x, y.shape)
    
    tf.Tensor(b'some text', shape=(), dtype=string) (901, 270, 480, 3)
    tf.Tensor(b'some more text', shape=(), dtype=string) (901, 270, 480, 3)