Search code examples
pythonscikit-learndata-analysisdata-science

Predict test data using model based on training data set?


Im new to Data Science and Analysis. After going through a lot of kernels on Kaggle, I made a model that predicts the price of a property. Ive tested this model using my training data, but now I want to run it on my test data. Ive got a test.csv file and I want to use it. How do I do that? What i previously did with my training dataset:

#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/test.csv')

#factors that will predict the price
train_pr = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']

#set my model to DecisionTree
model = DecisionTreeRegressor()

#set prediction data to factors that will predict, and set target to SalePrice
prdata = train[train_pr]
target = train.SalePrice

#fitting model with prediction data and telling it my target
model.fit(prdata, target)

model.predict(prdata.head())

Now what I tried to do is, copy the whole code, and change the "train" with "test", and "predate" with "testprdata", and I thought it will work, but sadly no. I know I'm doing something wrong with this, idk what it is.


Solution

  • As long as you process the train and test data exactly the same way, that predict function will work on either data set. So you'll want to load both the train and test sets, fit on the train, and predict on either just the test or both the train and test.

    Also, note the file you're reading is the test data. Assuming your file is named properly, even though you named the variable to be train, you are currently training on your test data.

    #loading my train dataset into python
    train = pd.read_csv('/Users/sohaib/Downloads/train.csv')
    test = pd.read_csv('/Users/sohaib/Downloads/test.csv')
    
    #factors that will predict the price
    desired_factors = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
    
    #set my model to DecisionTree
    model = DecisionTreeRegressor()
    
    #set prediction data to factors that will predict, and set target to SalePrice
    train_data = train[desired_factors]
    test_data = test[desired_factors]
    target = train.SalePrice
    
    #fitting model with prediction data and telling it my target
    model.fit(train_data, target)
    
    model.predict(test_data.head())