Search code examples
pythonpandasdataframedimension

why sample's size is difference when concatenate two dataframe?


I transformed separately training set and test set to get dummies for categorical features with pandas.get_dummies().
So the dimension difference occurred because of categorical differences in the training set and the test set.
I tried to equalize the dimension.
But the problem below occurred.
Why is the sample size different when concatenating two dataframes?

enter image description here


Solution

  • In my opinion there is not default RangeIndex in X_train.index, so need create it before concat:

    X_train = X_train.reset_index(drop=True)
    

    Another solution is add parameter index for same indices in both DataFrames:

    diff_df2 = pd.Dataframe(np.zeros((X_train.shape[0], len(diff_dummy2))), 
                            columns=diff_dummy2,
                            index= X_train.index)