Search code examples
pythonmachine-learningscikit-learnvalueerrortrain-test-split

Error while fitting train and test sets, train_test_split method


I am trying to evaluate my model with train_test_split. I have defined the following functions to create the output array on the table (top column) according to the input in function:

def top_sh(num):
    ###Get the top(num) in Shanghai data and arrange
    ####input and output variables accordingly
    #Add column to be output value, either zero or one

    #shanghai = shanghai_cp.copy()
    if 'top' in shanghai.columns:
        shanghai.drop(columns = shanghai.columns[-1],inplace = True) 

    shanghai['top'] = shanghai['world_rank'].apply(lambda x: 1 if x<= num else 0)
    out = print('*****************'+ '\n' + 'Output array: Top'+ str(num)+ '\n' + 'Disregarding in Analysis: World rank')
    #call = print(shanghai.head(15))

    return out

Then I defined the process for the train test split as following:

def train_test(df,size, seed):
    ###Split the data into test and train sets and test

    #Get input output of df
    if df == 'shanghai':
        column1 = shanghai.columns[1:7]
        Y = shanghai.values[: , -1].astype(int)
        y = np.ravel(Y)
        X = shanghai.values[: , 1:7]
    elif df == 'times':
        column1 = times.columns[1:10]
        Y = times.values[: , -1].astype(int)
        y = np.ravel(Y)
        X = times.values[: , 1:10]
    else:
        return print('Available Datasets: "shanghai" , "times"')

    #Split into train and test
    X_Train, X_Test, Y_Train, Y_Test = train_test_split(X,Y, test_size=size, random_state=seed)

    #Get the regression
    model= LogisticRegression(solver='liblinear')
    model.fit(X_Train,X_Test)

    #See how accurately it is with the split
    result=model.score(X_Test,Y_Test)

    print(f'Accuaracy {result*100:5.3f}')

    return

I run the following code:

top_sh(50)
shanghai.head()
X.shape
Y
Y.shape
train_test('shanghai',0.3,7)
```

X.shape = (768, 8)
Y.shape = (768, )

I get the following error on train_test function, specifically on model.fit line:

> ValueError: bad input shape (150, 6)


Solution

  • The issue is most likely arising from what you pass to the fit. It is expecting X-values as predictors and Y-values as predictions, therefore what you this line is incorrect:

    model.fit(X_Train,X_Test)
    

    You should instead, try passing Y_train:

    model.fit(X_train,Y_train)