Search code examples
pythonscikit-learnclassification

sklearn binary classifier for dataset with datetime, categorical values without preprocessing?


I need to predict if signup-driver will actually start driving using some basic classifier.

city_name   signup_os   signup_channel  signup_date bgc_date    first_completed_date    did_drive
Strark      ios web     Paid             1/2/16     NaN         NaN                     no

Strark      windows     Paid             1/21/16    NaN         NaN                     no

the dataset has some date columns, what classifier from sklearn to use to train basic classifier?

it fails with datetime values. All the features are categorical or date values

from sklearn.model_selection import train_test_split
  


X = refined_df[['city_name','signup_os','signup_channel','signup_date','bgc_date']]

y = refined_df['did_drive']




from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state=0)

models = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()

# Support Vector Machines
from sklearn.svm import LinearSVC
models['Support Vector Machines'] = LinearSVC()

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()



from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy, precision, recall = {}, {}, {}

for key in models.keys():
    
    # Fit the classifier
    models[key].fit(X_train, y_train)
    
    # Make predictions
    predictions = models[key].predict(X_test)
    
    # Calculate metrics
    accuracy[key] = accuracy_score(predictions, y_test)
    precision[key] = precision_score(predictions, y_test)
    recall[key] = recall_score(predictions, y_test)

ValueError: could not convert string to float: 'Berton'. it cant convert city name to float. how to do it?

is there decision tree that accept datetime values without any additional conversion?


Solution

  • You can apply one-hot encoding to convert categorical features into numerical ones. Scikit-learn provides the OneHotEncoder

    from sklearn.preprocessing import OneHotEncoder
    
    encoder = OneHotEncoder(sparse=False)
    X_categorical = encoder.fit_transform(X[['city_name', 'signup_os', 'signup_channel']])
    

    Regarding the date conversion, you can extract some information from the actual date or you can try a unix timestamp conversion.

    X['signup_year'] = X['signup_date'].dt.year
    X['signup_month'] = X['signup_date'].dt.month
    

    Finally, rebuild the final input and split it.

    X = np.concatenate((X_categorical, X[['signup_year', 'signup_month']]), axis=1)