Search code examples
pythonpandasdataframexgboostcolumnsorting

Dataframe of different size but no difference in columns


I am realizing an XG Boost model. I did my train-test split on a dataframe having 91 columns. I want to use my model on a new dataframe which have different columns than my training set. I have removed the extra columns and added the ones which were present in the train dataset and not the new one.

enter image description here

However, I cannot use the models because the new set does not have the same number of columns but when I am computing the list of the differences in columns the list is empty.

enter image description here

Do you have an idea of how I could correct this problem ?

Thanks in advance for your time !


Solution

  • You can try like this :

    import pandas as pd
    
    X_PAU = pd.DataFrame({'test1': ['A', 'A'], 'test2': [0, 0]})
    print(len( X_PAU.columns ))
    X = pd.DataFrame({'test1': ['A', 'A']})
    print(len( X.columns ))
    
    # Your implementation
    print(set(X.columns) - set(X_PAU.columns)) #This should be empty set
    
    #
    print(X_PAU.columns.difference(X.columns).tolist()) # this will print the missing column name
    print(len(X_PAU.columns.difference(X.columns).tolist())) # this will print the difference number
    

    Output

    2
    1
    set()
    ['test2']
    1