Search code examples
pythonlogistic-regression

having trouble loading numpy array into a different shape


#Start your code
#Hint - use pandas to read the Excel file data and then extract the data to a nump array "data"
df = pd.read_excel('A3data.xlsx')
data = df[['Exam1', 'Exam2','Admission Decision']].to_numpy()

#End your code

print('shape of sample data:', data.shape) # Check if data is 100 by 3

Load data into X_train a numpy array of shape (100,2) and y_train of shape (100,1)

X_train = data[0, [0,2]] <---- my attempt
y_train = data[0, [0,2]]


#It is a good idea to visualize data on a scatter plot, if possible. Here we can.
x_class0 = X_train[y_train == 0] <-- this is where the error is occuring
x_class1 = X_train[y_train == 1]

# Create a scatter plot
plt.scatter(x_class0[:, 0], x_class0[:, 1], color='blue', label='Not Admitted')
plt.scatter(x_class1[:, 0], x_class1[:, 1], color='red', label='Adm`itted')

error showing too many indices for array: array is 1-d but 2 was indexed


Solution

  • If you want to index it like a numpy array, I think you'll need to use numpy slicing:

    all_data = pd.DataFrame(np.random.rand(10,3)).to_numpy()
    x = all_data[:,:2]
    y = all_data[:,:1]
    x,y
    

    As your current code will only return a single row, not the full columns.

    However, that may cause problems down the road when you've lost your column headers. I'd suggest the following instead given your example:

    data = df[['Exam1', 'Exam2','Admission Decision']]
    
    x_train = data[['Exam1', 'Exam2']]
    y_train = data['Admission Decision']
    

    (But remember to do your train-test split before you separate into x and y)

    However, the issue you're having here:

    # Notice how these are the same dataframe!
    X_train = data[0, [0,2]] <---- my attempt
    y_train = data[0, [0,2]]
    
    
    #It is a good idea to visualize data on a scatter plot, if possible. Here we can.
    x_class0 = X_train[y_train == 0] <-- this is where the error is occurring
    x_class1 = X_train[y_train == 1]
    

    is because X_train and y_train are the same data frame as you've defined them -- and as such y_train has two columns, not just one.