Search code examples
pythonpandasnumpydatasettraining-data

Creating a dataset from 2d matrices


I have a series of 2d matrices like these two:

matrix_1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix_2 = np.array([[10, 11, 12], [13, 14, 15], [16, 17, 18]])

And Each matrix has a label like:

labels = np.array([0, 1])

I want to make a dataset from these matrices to train my ML model later. First I tried to make small .csv files for each matrix but we cannot train an ML model on multiple .csv files.

Then, I tried this code:

matrix_1_flat = matrix_1.flatten()
matrix_2_flat = matrix_2.flatten()

dataset = np.array([matrix_1_flat, matrix_2_flat])
dataset = np.transpose(dataset_1)

But I feel like that spatial information will be lost. Is there any other function apart from those I'm using to create what I want?

Actually by labels, I mean y variables in machine learning terms. In this example, matrix_1 and matrix_2 (two 2d matrices) are my x_train and the label of matrix_1 is 0 (or even cat if it makes it easier to understand) and the label of matrix_2 is 1 (or dog).

I want the train and its labels to be like this:

x_train = np.array([[[1, 2, 3],[4, 5, 6],[7, 8, 9]],[[10, 11, 12],[13, 14, 15],[16, 17, 18]]])  
y_train = y = np.array(["cat", "dog"])

Solution

  • I guess you want to make a dataset such that each x-y pair (a matrix, and a label) have x in its original shape (to not loose spatial information, treating each matrix as image-like).

    With the aid of numpy, you can create a compressed file representing the dataset as follows:

    matrix_1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    matrix_2 = np.array([[10, 11, 12], [13, 14, 15], [16, 17, 18]])
    
    # preparing "x" and "y" - the dataset
    matrices = [matrix_1, matrix_2]
    labels = np.array([0, 1])
    
    # save into an npz object: 
    #  - it's dict-like, so we use "x" and "y" as keys
    #  - this will be saved as "matrix_dataset.npz"
    np.savez_compressed('matrix_dataset', x=matrices, y=labels)
    

    The npz file can be later loaded into memory:

    ds = np.load('matrix_dataset.npz')
    

    You can access the "x" and "y" fields simply by their key:

    # e.g. if you want to train your model, after loading
    x_train = np.array(ds['x'])
    y_train = np.array(ds['y'])
    
    # your model fitting code...
    

    Note that the shape of x_train is now (N, 3, 3) where N (in this case is 2) refers to the batch axis, so doing x_train[0] will retrieve the first 3x3 matrix.