how to split train and test data from a .mat file in sklearn?

I have a mnist dataset as a .mat file, and want to split train and test data with sklearn. sklearn reads the .mat file as below:

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sat Oct  8 18:13:47 2016',
 '__version__': '1.0',
 '__globals__': [],
 'train_fea1': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'train_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8),
 'test_fea1': array([[ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        ...,
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ..., 64,  0,  0],
        [ 0,  0,  0, ..., 25,  0,  0]], dtype=uint8),
 'test_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8)}

How to do that?

Solution

I am guessing you meant you loaded the .mat data file into Python using scipy not sklearn. Essentially, a .mat data file can be loaded like so:

import scipy.io
scipy.io.loadmat('your_dot_mat_file')

scipy reads this as a Python dictionary. So in your case, the data you read is split into train: train_fea1, having train-label train_gnd1 and test: test_fea1 having test-label test_gnd1.

To access your data, you can:

import scipy.io as sio
data = sio.loadmat('filename.mat')

train = data['train_fea1']
trainlabel = data['train_gnd1']

test = data['test_fea1']
testlabel = data['test_gnd1']

If you however, what to split your data using sklearn's train-test-split, you can first combine features and labels from your data, then randomly split like so (after loading data as above):

import numpy as np
from sklearn.model_selection import train_test_split

X = np.vstack((train,test))
y = np.vstack((trainlabel, testlabel))

X_train, X_test, y_train, y_test = train_test_split(X, y, \
     test_size=0.2, random_state=42) #random seed for reproducible split