I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris
, load_wine
, load_breast_cancer
, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data
function but it seems to be for internal use of the previous three I mentioned, cause I can't import
it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks
You can load whatever you want and then use sklearn
models.
If you have a .csv
file, pandas
would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt
or .xls
files.
The important thing in order to use sklearn models is this:
X
should be always be an 2D array with shape [n_samples, n_variables]y
should be the target varible.