Search code examples
pythonnode.jsscikit-learndecision-tree

use python's sklearn module with custom dataset


I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.

I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.

Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)

How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?

In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.

Thanks


Solution

  • You can load whatever you want and then use sklearn models.

    If you have a .csv file, pandas would be the best option.

    import pandas as pd
    
    mydataset = pd.read_csv("dataset.csv")
    
    X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
    y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
    ...
    sklearn_model.fit(X,y)
    

    Similarily, you can load .txt or .xls files.

    The important thing in order to use sklearn models is this:

    • X should be always be an 2D array with shape [n_samples, n_variables]
    • y should be the target varible.