Sklearn algorithm require a feature and a label for it to learn.
I have a CSV file which contain some data. These data is actually a challenge from hackerearth website in which participant need to create a learning algorithm that learn from data on massive amount of individuals from affiliate network and their ad click performance which then predict future performance of other individuals in the affiliate network which allow the company to optimize their ad performance.
The features in these data include id,date,siteid, offerid, category, merchant, countrycode,type of browser, type of device and the number of clicks their ads have gotten.
So my plan is to use the first 7 information as my feature and ad click as label. Unfortunately, countrycode,browser and device information is in text (Google Chrome, Desktop) and not integers which can be turned into array.
Q1: Is there a way for sklearn to accept not just numpy arrays but also words as features? Am I support to use vectorizer for this? If so, how would I do it? If not, can I just replace the wording data into numbers (Google Chrome replaced by 1, firefox replaced by 2) and still have it to work? (I am using Naive Bayes algorithm)
Q2: Would Naive Bayes algorithm be suitable for this task? Since this competition require participant to create a program that predict the probability of individuals in affiliate network have their ads click, I assume Naive Bayes would be best suited.
Training data : https://drive.google.com/open?id=1vWdzm0uadoro3WcpWmJ0SVEebeaSsHvr
Testing data : https://drive.google.com/open?id=1M8gR1ZSpNEyVi5W19y0d_qR6EGUeGBQl
My messy coding and horrible attempt at this challenge which I don't think will be much help:
from sklearn.naive_bayes import GaussianNB
import csv
import pandas as pd
import numpy as np
data = []
from numpy import genfromtxt
import pandas as pd
data = genfromtxt('smaller.csv', delimiter=',')
dat = pd.read_csv('smaller.csv', delimiter=',')
print(dat(siteid))
feature = []
label =[]
i = 1
j = 1
while i <17:
feature.append(data[i][2:8])
i += 1
while j <17:
label.append(data[i][9])
j += 1
clf = GaussianNB()
clf.fit(feature,label)
print(clf.predict([data[18][2:8]]))
print(data[18])
Answer for Question1: No. Sklearn only works with numerical data. So you need to convert your text to numbers.
Now to convert text to numbers you can follow multiple approaches. First is as you said just assign numbers to them. But you need to to take in account if the text data shows any order like the numbers assigned to them or not. In that case, most often one-hot encoding is used. Please see the below scikit-learn documentation for that: - http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Answer to Question 2: It depends on the data and task at hand.
No single algorithm is capable of handling every type of data optimally.
Most of the times we need to compare multiple algorithms and see what gives best result for our data. See this example:
Even in a single algorithm we need to check for various parameter values, tune those values for maximum score. This is called grid-search. See this example:
Hope this clears your doubts. Make sure to go through the scikit-learn documentation and examples:
They are one of the best out there.