Search code examples
pythontensorflowdatasettensorflow-datasetssampling

tensorflow dataset how to form online distribution


I am trying to build a histogram of my dataset online while the samples "x" are being generated so that I can use this histogram to stir the direction of the distribution of samples "y". Here is a toy example which is not really working:

import tensorflow as tf
dataset = tf.data.Dataset.random(seed=4).take(10).map(lambda x: x%10)
hs = tf.convert_to_tensor(np.zeros(10), tf.float32) # the histogram
dataset = dataset.map(lambda x : proc(x,hs))

where the proc function is:

def proc(x,hs):  
  y = tf.math.argmin(input = hs)  
  hs = tf.tensor_scatter_nd_add(hs, [[x]], [1])  # hist[y]+=1
  return x,y

As you might expect the variable "hs" ends up not changing (the function just assigns a new object to variable hs). Is there anyway i could make this work? (I have looked at rejection sampling in dataset but I dont want to even create samples that might need to be later discarded, I like to have an online distribution and generate accordingly).

More information: So the real generator for "x" is not producing uniform distributions (unlike this example). So the goal of this histogram is to help me generate the "y" samples by filling in the lowest frequency bins of x so that in the end the distribution of {x,y} mimic a uniform distribution in the end.


Solution

  • If hs should be part of your dataset, your code is working fine:

    import tensorflow as tf
    import numpy as np
    
    def proc(x, hs):  
      x = (x+1)%10
      hs = tf.tensor_scatter_nd_add(hs, [[x]], [1])  # hist[y]+=1
      return x, hs
    dataset = tf.data.Dataset.random(seed=4).take(10).map(lambda x: x%10)
    hs = tf.convert_to_tensor(np.zeros(10), tf.float32) 
    dataset = dataset.map(lambda x : proc(x,hs))
    
    for x, y in dataset:
      print(x, y)
    
    tf.Tensor(6, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 0. 1. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(0, shape=(), dtype=int64) tf.Tensor([1. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(9, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 1.], shape=(10,), dtype=float32)
    tf.Tensor(4, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 1. 0. 0. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(6, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 0. 1. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(4, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 1. 0. 0. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(4, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 1. 0. 0. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(5, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 1. 0. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(6, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 0. 1. 0. 0. 0.], shape=(10,), dtype=float32)
    tf.Tensor(7, shape=(), dtype=int64) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 1. 0. 0.], shape=(10,), dtype=float32)
    

    If you want hs as a separate tensor, you could additionally run:

    hs = tf.convert_to_tensor(list(dataset.map(lambda x, y: y)))
    print(hs)
    
    [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
     [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
     [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]], shape=(10, 10), dtype=float32)
    

    If you just want to update the tensor hs based on your dataset, try:

    import tensorflow as tf
    import numpy as np
    
    def proc(ds, hs):
      for x in ds:
        x = (x+1)%10
        hs = tf.tensor_scatter_nd_add(hs, [[x]], [1])  # hist[y]+=1
      return hs
    dataset = tf.data.Dataset.random(seed=4).take(10).map(lambda x: x%10)
    hs = tf.convert_to_tensor(np.zeros(10), tf.float32) 
    hs = proc(dataset, hs)
    print(hs)
    
    tf.Tensor([1. 0. 0. 0. 3. 1. 3. 1. 0. 1.], shape=(10,), dtype=float32)
    

    Update 1:

    import tensorflow as tf
    
    def proc(x,hs):
      y = tf.math.argmin(input = hs)
      hs.assign(tf.tensor_scatter_nd_add(hs.value(), [[x]], [1]))  # hist[y]+=1
      return x, y
    
    dataset = tf.data.Dataset.random(seed=4).take(10).map(lambda x: x%10)
    hs = tf.Variable(np.zeros(10), tf.int64) # the histogram
    dataset = dataset.map(lambda x : proc(x, hs))