python algorithm scikit-learn bioinformatics scikits

How to format protein energy data from text file to matlab for scikit-feature algorithms

i need to test some algorithms from scikit-feature and i want to use some datasets that are in text file, for example: link

I only know that the matlab files the algorithms use as input, are formated like this: the class is in a 'Y' array and the data in a 'X' array, here is some code to show just how they open and get the data from the .mat files:

Here is the algorithm code

#test_CFS.Py
mat = scipy.io.loadmat('../data/colon.mat')
X = mat['X']    # data
X = X.astype(float)
y = mat['Y']    # label
y = y[:, 0]
n_samples, n_features = X.shape

I tried to make a code to generate a mat file from my data in .txt, and the it was successfully processed by the algorithm i used (test_CFS.py), it didn't show any error with the test file i used with just 9 columns and 8 rows.

Here is my code to make a .mat file from a .txt

#textToMat.py

import numpy as np
import scipy.io as sio

file = open("matrix.txt", "r")
data = file.readlines()

Y = []
subY = []

X = []
subX = []

print len(data)
print len(data[0].split())

for i in range(len(data)):
    values = data[i].split()

    subY.append(np.array(float(values[0]),dtype=float))
    Y.append(np.array(subY))
    subY = []

    for j in range(1, len(values)):
        subX.append(np.array(float(values[j]), dtype=float))

    X.append(subX)
    subX = []

npY = np.array(Y, dtype=float)
npX = np.array(X, dtype=float)

sio.savemat('matrix.mat', {'Y':npY,'X':npX})

But then, when i tried to run the algorithm with the big mat file i generated it retuns me this error.

Traceback (most recent call last):
  File "test_CFS.py", line 47, in <module>
    main()
  File "test_CFS.py", line 12, in main
    X = X.astype(float)
ValueError: setting an array element with a sequence.

Yoy may ask why i append an array with one data to another array, that's because when i print the data from the scikit-feature's mat file it returns me this:

{'Y': array([[-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [ 1],
       [-1],
       [-1],
       [ 1],
       [ 1],
       [-1],
       [-1],
       [-1],
       [-1],
       [ 1],
       [-1],
       [ 1],
       [ 1],
       [-1],
       [-1],
       [ 1],
       [ 1],
       [-1],
       [-1],
       [-1],
       [-1],
       [ 1],
       [-1],
       [ 1]], dtype=int16), 'X': array([[ 2,  0,  0, ...,  0,  2, -2],
       [ 2,  2,  0, ...,  2,  0, -2],
       [-2,  2,  2, ..., -2, -2, -2],
       ..., 
       [ 0, -2, -2, ...,  0,  2, -2],
       [ 0,  0, -2, ...,  0, -2, -2],
       [ 0, -2, -2, ...,  0,  0,  0]], dtype=int16), '__version__': '1.0', '__header__': 'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Wed Mar 25 15:17:35 2015', '__globals__': []}

In my case i'm using float values.

Solution

Your data is wrong. For the numpy conversion all rows need to be of the same length. All your rows in the file you provided have 643 entries, except row 232, it has 644. Remove that row (or manipulate it accordingly) and your code should work fine.