Search code examples
pythonmachine-learningdatasethistogram

How could I find the number of instances per fold, in my dataset?


I have been given an .npz file containing the data. I have explored the dataset, and noted that it has 5 datatypes:

cell_data = np.load("C:/Users/alexs/Documents/DataMining/cell-data.npz")
cell_data.files

Giving this at the output:

['images', 'counts', 'folds', 'compressed', 'allow_pickle']

As well as the image attached.

I am promised the dataset itself has 3 folds. The count is an Nx6 matrix with each row corresponding to a single image patch and each column corresponding to the 6 cell types (called T1,T2, … , T6). The folds seems to be an 1xN matrix, however I am not sure, it consists of values ranging {0,2}.

How would I find out the number of instances per fold, and if this were possible, how would I find out the ranges of the folds, i.e. which instances belong to which folds (or group the instances into their own seperate array that respresents each fold like, fold1 = x, fold2 = x_2 etc.) to then plot a histogram for each fold, such that the counts of each cell type are plotted separately(6 plots in total)?


Solution

  • Ok, since you are new to programming I will explain how indexing works (in numpy, which is an almost-universal mathametical library in python).

    Say we have a variable folds which is defined as:

    import numpy as np
    folds = np.array([1,1,2,2,1,2,1,0,0,0,1,2,1,2,0,0,2,1])
    

    We can easily count each fold occurrence by performing a list comprehension:

    num_folds = 3
    fold_counts = [np.sum(folds==I) for I in range(num_folds)]
    #will return [5, 7, 6]
    

    This will return the count as we are comparing each element in folds to the fold numbers 0, 1, and 2 (false if the element is not equal to I, true otherwise). We can sum boolean values (set them equal to 1) to get the total amount.

    To answer your other question, we can use a similar bit of code to separate the images into their folds:

    #assuming images are in a list:
    image_folds = [[images[J] for J in np.where(folds==I)[0]] for I in range(num_folds)]
    
    #assuming images are in an array of size [num_images, width, height, channels]
    image_folds = [images[folds==I] for I in range(num_folds)]