python node.js scikit-learn decision-tree

use python's sklearn module with custom dataset

I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.

I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.

Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)

How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?

In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.

Thanks

Solution

You can load whatever you want and then use sklearn models.

If you have a .csv file, pandas would be the best option.

import pandas as pd

mydataset = pd.read_csv("dataset.csv")

X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)

Similarily, you can load .txt or .xls files.

The important thing in order to use sklearn models is this:

X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.