Search code examples
pythonpandaslogistic-regressionsklearn-pandas

Divide dataframe into two sets according to a column


I have Dataframe df i choosed some coulmns of it and i want to divide them into xtrain and xtest accoring to a coulmn called Sevrice. So that raws with 1 and o into the xtrain and nan into xtest.

Service
1
0
0
1
Nan
Nan

xtarin = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]

EDITED

    ytrain = df['Service'].dropna()
    Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    logistic = LogisticRegression()
    logistic.fit(xtrain, ytrain)
    logistic.predict(xtest)

I get this error for logistic.predict(xtest)

X has 220 features per sample; expecting 307

Solution

  • I think you need isnull:

    Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
    

    Another solution is invert boolean mask by ~:

    mask = df['Service'].notnull()
    xtarin = df.loc[mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
    Xtest = df.loc[~mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
    

    EDIT:

    df = pd.DataFrame({'Service':[1,0,np.nan,np.nan],
                       'Age':[4,5,6,5],
                       'Fare':[7,8,9,5],
                       'GSize':[1,3,5,7],
                       'Deck':[5,3,6,2],
                       'Class':[7,4,3,0],
                        'Profession_title':[6,7,4,6]})
    
    print (df)
       Age  Class  Deck  Fare  GSize  Profession_title  Service
    0    4      7     5     7      1                 6      1.0
    1    5      4     3     8      3                 7      0.0
    2    6      3     6     9      5                 4      NaN
    3    5      0     2     5      7                 6      NaN
    
    ytrain = df['Service'].dropna()
    xtrain = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
    xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    logistic = LogisticRegression()
    logistic.fit(xtrain, ytrain)
    print (logistic.predict(xtest))
    [ 0.  0.]