Search code examples
scikit-learnmnistmattrain-test-split

how to split train and test data from a .mat file in sklearn?


I have a mnist dataset as a .mat file, and want to split train and test data with sklearn. sklearn reads the .mat file as below:

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sat Oct  8 18:13:47 2016',
 '__version__': '1.0',
 '__globals__': [],
 'train_fea1': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'train_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8),
 'test_fea1': array([[ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        ...,
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ..., 64,  0,  0],
        [ 0,  0,  0, ..., 25,  0,  0]], dtype=uint8),
 'test_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8)}

How to do that?


Solution

  • I am guessing you meant you loaded the .mat data file into Python using scipy not sklearn. Essentially, a .mat data file can be loaded like so:

    import scipy.io
    scipy.io.loadmat('your_dot_mat_file')
    

    scipy reads this as a Python dictionary. So in your case, the data you read is split into train: train_fea1, having train-label train_gnd1 and test: test_fea1 having test-label test_gnd1.

    To access your data, you can:

    import scipy.io as sio
    data = sio.loadmat('filename.mat')
    
    train = data['train_fea1']
    trainlabel = data['train_gnd1']
    
    test = data['test_fea1']
    testlabel = data['test_gnd1']
    

    If you however, what to split your data using sklearn's train-test-split, you can first combine features and labels from your data, then randomly split like so (after loading data as above):

    import numpy as np
    from sklearn.model_selection import train_test_split
    
    X = np.vstack((train,test))
    y = np.vstack((trainlabel, testlabel))
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, \
         test_size=0.2, random_state=42) #random seed for reproducible split