Search code examples
pythonpandasdataframetrain-test-split

How to split datatable dataframe into train and test dataset in python


I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error.

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split

dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>

I try a work around method by converting the dataframe to numpy array:

classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()

Like that it works, but, I don't know if there is a way allowing the train_test_split working correctly like in pandas dataframe.

Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df) we get :

     | CCC  CCG  CCU  CCA  CGC  CGG  CGU  CGA  CUC  CUG  …  
---- + ---  ---  ---  ---  ---  ---  ---  ---  ---  ---     
   0 |   0    0    0    0    2    0    1    0    0    1  …  
   1 |   0    0    0    0    1    0    2    1    0    1  …  
   2 |   0    0    0    1    1    0    1    0    1    2  …  
   3 |   0    0    0    1    1    0    1    0    1    2  …  
   4 |   0    0    0    1    1    0    1    0    1    2  …  
   5 |   0    0    0    1    1    0    1    0    1    2  …  
   6 |   0    0    0    1    0    0    3    0    0    2  …  
   7 |   0    0    0    1    1    0    0    0    1    2  …  
   8 |   0    0    0    1    1    0    1    0    1    2  …  
   9 |   0    0    1    0    1    0    1    0    1    3  …  
  10 |   0    0    1    0    1    0    1    0    1    3  …  
      ...

Thanks for you help.


Solution

  • The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes) from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):

    source code before split method:

    import datatable as dt
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import ExtraTreesClassifier
    
    dt_df = dt.fread(csv_file_path)
    
    classe = np.ravel(dt_df[:, "classe"])
    del dt_df[:, "classe"])
    

    source code after split method:

    ExTrCl = ExtraTreesClassifier()
    ExTrCl.fit(X_train, y_train)
    pred_test = ExTrCl.predict(X_test)
    

    method 1: convert to numpy

    # source code before split method
    
    dt_df = dt_df.to_numpy()
    
    X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
    
    # source code after split method
    

    method 2: convert to numpy and return back to datatable dataframe after the split:

    # source code before split method
    
    dt_df = dt_df.to_numpy()
    
    X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
    
    X_train = dt.Frame(X_train)
    
    # source code after split method
    

    method 3: convert to pandas dataframe

    # source code before split method
    
    dt_df = dt_df.to_pandas()
    
    X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
    
    # source code after split method
    

    These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:

                           T convert    T.train     T.pred
    M1 to_numpy             3           85          0.5
    M2 to_numpy and back    3.5         29          0.5
    M3 to pandas            4           37          4
    

    enter image description here