Search code examples
pythontensorflowmulticlass-classificationimage-classification

tensorflow image_dataset_from_directory get certain pictures of the method by index list


I have a folder called train_ds (don't get confused by the name, is just a folder with pics) in which I have 5 subfolders with pictures. Each one is a different class.

I'm running 5 different trained models over this train_ds folder to get the inferences. What I do want is to explicitly get in which pictures all models fail to infer right. For that:

  • Use the tf method image_dataset_from_directory to load pics.
  • Use the function inferences_target_list to get a list of inferred elements and the real labels. Both lists have same length.
  • Use the function get_missclassified to get a list of the indexes that have different value between the inference and the real value. Voila, I got the mismatched ones for one model.
  • Run the same for the 5 trained models.
  • Get the common indexes for the 5 different processes.

So I could say, I have indexed all images in the train_ds folder and from all of them, I got what indexes have an image classified wwrong, for all models.

The question now is... How do I get the pictures associated to that indexes from the image_dataset_from_directory method?

Functions:

def inferences_target_list(model, data):
    '''
    returns 2 lists: inferences list, real labels
    '''
    # over train set fold1
    y_pred_float = model.predict(data)
    y_pred = np.argmax(y_pred_float, axis=1)

    # get real labels
    y_target = tf.concat([y for x, y in data], axis=0) 
    y_target
    print("lenght inferences and real labels: ", len(y_pred), len(y_target))
    return y_pred, y_target


def get_missclassified(y_pred, y_target):
  '''
  returns a list with the indexes of real labels that were missclassified
  '''
  missclassified = []
  for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
    if pred!=target:
      #print(i, pred, target)
      missclassified.append(i)
  print("total missclassified: ",len(missclassified))
  return missclassified

Method:

missclassified_train_folders=[]

for f in folders: # at the moment just 1 folder 
  print(f)
  for nn in models_dict: # dictionary of trained models
    print(nn)

    # -- train dataset for each folder
    train_path = reg_input+f+"/"+'train_ds/'
    # print("\n train dataset:", "\n", train_path)
    train_ds = image_dataset_from_directory(
        train_path,
        class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
        seed=None,
        validation_split=None, 
        subset=None,
        image_size= image_size,
        batch_size= batch_size,
        color_mode='rgb',
        shuffle=False 
        )
    
    # inferences and real values
    y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
    
    # missclassified ones
    missclassified = get_missclassified(y_pred, y_target)
    print("elements missclassified in {} for model {}: ".format(f, nn), len(missclassified))
    missclassified_train_folders.append(missclassified)

I got the list of indexes, but I don't know how to apply it.

Thanks in advance! | (• ◡•)| (❍ᴥ❍ʋ)


Solution

  • the given by @ma7555 was the simple solution I was looking for, nevertheless the labels list output with the ma755 method is different than the one using tf.concat([y for x, y in train_ds], axis=0).

    train_ds is created using the image_dataset_from_directory method, and have 5 subfolders inside (mi classes). The clumsy solution I got at the moment is:

    • get list of inferred labels and real ones with inferences_target_list
    • compare 2 lists, check what labels are different and store their index with get_missclassified
    • get the list of elements in folders with get_list_of_files. this should be the same than paths for ma7555. i didn't check if the order was the same yet
    def inferences_target_list(model, data):
        '''
        returns 2 lists: inferences list, real labels
        '''
        # over train set fold1
        y_pred_float = model.predict(data)
        y_pred = np.argmax(y_pred_float, axis=1)
    
        # get real labels
        y_target = tf.concat([y for x, y in data], axis=0) 
        y_target
        print("lenght inferences and real labels: ", len(y_pred), len(y_target))
        return y_pred, y_target
    
    
    def get_missclassified(y_pred, y_target):
      '''
      returns a list with the indexes of real labels that were missclassified
      '''
      missclassified = []
      for i, (pred, target) in enumerate(zip(y_pred, y_target.numpy().tolist())):
        if pred!=target:
          #print(i, pred, target)
          missclassified.append(i)
      print("total missclassified: ",len(missclassified))
      return missclassified
    
    def get_list_of_files(dirName):
        '''
        create a list of file and sub directories names in the given directory
        found here => https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/
        ''' 
        listOfFile = os.listdir(dirName)
        allFiles = list()
        # Iterate over all the entries
        for entry in listOfFile:
            # Create full path
            fullPath = os.path.join(dirName, entry)
            # If entry is a directory then get the list of files in this directory 
            if os.path.isdir(fullPath):
                allFiles = allFiles + get_list_of_files(fullPath)
            else:
                allFiles.append(fullPath)
                    
        return allFiles
    

    Start

    misclassified_train_folders=[]
    
    for f in folders:
      print(f)
      for nn in models_dict:
        #print(nn)
    
        # -- train dataset for each folder
        train_path = reg_input+f+"/"+'train_ds/'
        # print("\n train dataset:", "\n", train_path)
        train_ds = image_dataset_from_directory(
            train_path,
            class_names=["Bedroom","Bathroom","Dinning","Livingroom","Kitchen"],
            seed=None,
            validation_split=None, 
            subset=None,
            image_size= image_size,
            batch_size= batch_size,
            color_mode='rgb',
            shuffle=False 
            )
        
        # list of paths for analysed images
        pic_list = get_list_of_files(train_path)
        
        # inferences and real values
        y_pred, y_target = inferences_target_list(models_dict[nn], train_ds)
        
        # misclassified ones
        misclassified = get_misclassified(y_pred, y_target)
        print("elements misclassified in {} for model {}: ".format(f, nn), len(misclassified))
        misclassified_train_folders.append(misclassified)
    
    
    • Now I have a list with 5 lists inside: Those lists are made with all misclassified elements by every model in my first folder. Getting the pictures that always are misclassified:
    common_misclassified = list(set.intersection(*map(set, misclassified_train_folders)))
    # this are the indexes of that images
    print(len(common_misclassified), "\n", common_misclassified)
    
    • to get the path of those pics:
    pic_list_missclassified = [pic_list[i] for i in common_missclassified]
    
    # indexes of common missclassified elements for all models
    print(len(pic_list_missclassified))