Search code examples
csvpandaskagglesklearn-pandas

ValueError in creating submission csv


I am learning data science and reading other people's scripts. There is this one titanic algorithm (kaggle) has this code to apply the Logistic Regression then supposedly export to a .csv file as suggested in the code. However, it always generates an error message after I run the code. The original script is found here, and the .csv data that's being read into the code is here: train.csv test.csv

From Input[24] to Input[28] are for setting up LogisticRegression. Up to Input[27] the code still runs without error. When running Input[28]:

    acc_log = predict_model(X_data, Y_data, logreg, X_test_kaggle, 'submission_Logistic.csv')

I receive an error message:

    ValueError: could not convert string to float: 'Q'

I tried to add "try/except" to bypass the error message so the code can continue.

    try:
        acc_log = predict_model(X_data, Y_data, logreg, X_test_kaggle, 'submission_Logistic.csv')
    except ValueError:
        pass

This code is a bit too sophisticated for me to debug to see which step goes wrong and where in the file that has the string in place of the desired input for a float. So I would like to ask for help here to better understand this and seek for a proper solution. Thanks.


Solution

  • It looks like you didn't run cell 16 in the notebook link you provided, in which Embarked values are converted to integers (including the string value Q, which is throwing the error you're seeing):

    Cell 16

    # fill the missing values of Embarked feature with the most common occurance
    freq_port = train_df.Embarked.dropna().mode()[0]
    for dataset in combine:
        dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
    
    for dataset in combine:
        dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    train_df.head()
    

    I just ran all the cells in order and the LogisticRegression section worked fine for me. Try shutting down your notebook and re-running all the cells in the order they appear.

    A general data science tip:
    When you've already trained a model but your predict() function is throwing an error, it's helpful to look at the test data you're inputting and try and figure out what's wrong there.
    In this case, searching the values in X_test_kaggle for the string Q might have revealed the problem was with the Embarked field, and that could have served as a first breadcrumb in tracking the problem back to its source.