Search code examples
statisticsregressionlinear-regressionanalyticslogistic-regression

how to predict an outcome given too many continuous and categorical variables?


I have a sample dataset wherein I'm trying to figure out if there are strong predictors of student passing an exam (which has the value 0 or 1). However there is a mix of continuous variables and categorical variables (around 100 columns) in the dataset (like mother's profession, city, is_male, is_female etc? Can someone please guide which model and variables should I choose to build a model?

This is what the dataset looks like: sample image of data set


Solution

  • Remove column variables that have 0 observations as they are useless for modeling. Columns that have single values on all rows can also be removed. They are all referred as zero-variance predictors because there will be no variation by the predictor.

    Use nunique() function to summarize number of unique values in each column

    DataFrame.nunique(axis=0, dropna=True)
    

    Use drop() to drop useless columns.

    DataFrame.drop('label', axis=0, inplace=True)
    

    Missing values in numeric attributes can be filled with median. Change the column that has only two kind of values to boolean like null and YD in "Mentor_Orgs_Column".

    Check impact of each categorical and numeric attribute on target attribute:

    For example:

    print(train[["mothersProfession","Pass"]].groupby(['mothersProfession'],as_idex=False).mean())
    #Provides impact of 'mothers profession' on training data.
    

    This will help find attributes that will help your prediction. You can then apply different classifiers on this data using scikit learn on a trial and error basis to get different insights.