Search code examples
pythonpandasdataframeindexingnan

Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns


I have a dataframe which has several columns, so I chose some of its columns to create a variable like this.

xtrain = df[['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title']]

I want to drop from these columns all rows where the Survive column in the main dataframe is nan.


Solution

  • You can pass a boolean mask to your df based on notnull() of 'Survive' column and select the cols of interest:

    In [2]:
    # make some data
    df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
    df['Survive'].iloc[2] = np.NaN
    df
    Out[2]:
        Survive       Age      Fare  Group_Size      deck    Pclass     Title
    0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
    1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
    2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
    3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
    4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482
    

    Now pass a mask to loc to take only non NaN rows:

    In [3]:
    xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
    xtrain
    
    Out[3]:
            Age      Fare  Group_Size      deck    Pclass     Title
    0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
    1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
    3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
    4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482