Search code examples
pythonarraysnumpymachine-learningxgboost

XGBOOST: Multiple dimension array as input to model


How would you train a model with a dataset that has 4 matrices per row? Below is a minimal reproducible example with a (2rows, 4 matrices, 3 X 6 matrix) dataset to train.

import numpy as np
import xgboost as xgb
# from scipy.sparse import csr_matrix
x =   [np.array([[[ 985. ,  935. ,  396. ,   258.5,  268. ,  333. ],
         [ 968. , 1000. , 1048. ,   237.5,  308.5,  359.5],
         [ 350. ,  336. ,  422. ,   182.5,  264.5,  291.5]],
 
        [[ 867. ,  863. ,  512. ,   511. ,  485.5,  525. ],
         [ 917. ,  914. ,  739. ,   450. ,  524.5,  571. ],
         [ 663. ,  656. ,  768. ,   352.5,  460. ,  439. ]],
 
        [[ 569. ,  554. ,  269. ,   240. ,  240. ,  263.5],
         [ 597. ,  592. ,  560. ,   222. ,  244.5,  290. ],
         [ 390. ,  377. ,  457. ,   154.5,  289.5,  272. ]],
 
        [[2002. , 2305. , 3246. ,  3586.5, 3421.5, 3410. ],
         [2378. , 2374. , 1722. ,  3351.5, 3524. , 3456. ],
         [3590. , 3457. , 3984. ,  2620. , 2736.5, 2290. ]]]),
        
 np.array([[[ 412. ,  521. ,  642. ,   735. ,  847.5,  358.5],
         [ 471. ,  737. ,  558. ,   331.5,  324. ,  317.5],
         [ 985. ,  935. ,  396. ,   258.5,  268. ,  333. ]],
 
        [[ 603. ,  674. ,  786. ,   966. , 1048. ,  605.5],
         [ 657. ,  810. ,  789. ,   582. ,  573. ,  569.5],
         [ 867. ,  863. ,  512. ,   511. ,  485.5,  525. ]],
 
        [[ 325. ,  426. ,  544. ,   730.5,  804.5,  366.5],
         [ 396. ,  543. ,  486. ,   339.5,  334. ,  331. ],
         [ 569. ,  554. ,  269. ,   240. ,  240. ,  263.5]],
 
        [[3133. , 3808. , 3617. ,  4194.5, 4098. , 3802. ],
         [3479. , 3488. , 3854. ,  3860. , 3778.5, 3643. ],
         [2002. , 2305. , 3246. ,  3586.5, 3421.5, 3410. ]]])]
    
y = [np.array(6), np.array(10)]

This is an attempt to convert the matrix into a DMatrix which results in an error. I've tried other solutions such as using a csr_matrix too.

A solution could be to turn this: (2rows, 4 matrices, 3 X 6 matrix) to
(2rows, ~10 length) by applying dimensionality reduction to the matrices and reshaping it? Unsure if this is the best solution?

# X = csr_matrix(x)
dtrain_xbg = xgb.DMatrix(x, label=y)

params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
model = xgb.train(dtrain=dtrain_xbg, params=params,num_boost_round=200)

Solution

  • reshaping the array to a dataframe

    data = pd.concat([pd.DataFrame(x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).reshape(1,x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).shape[0]*x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).shape[1])) for i in range(len(x))], ignore_index=True)
    

    reducing the dataframe from 100K to 100 columns

    from sklearn.decomposition import PCA
    pca = PCA(n_components=100)
    principalComponents = pca.fit_transform(pca_data.fillna(0))
    principalDf = pd.DataFrame(data = principalComponents)
    principalDf