Search code examples
pythonmachine-learningscikit-learnsklearn-pandas

How to create data for machine learning project


I am working on a machine learning project where I am creating data for a user. Data consist of his/her age, year of experience, city, type of business and any previous loan. Rules for the data are like below

  1. If a user has good age, high experience & he is in good business and no previous loan, so loan will be provided to him

  2. If a user has good age, low experience & he is in good business and no previous loan, so loan will not be provided to him

  3. If a user has good age, high experience & he is in good business and previous loan, so loan will not be provided to him

So just like this I have created a csv file which has all of this data. Below is the link to csv file

https://drive.google.com/file/d/1zhKr8YR951Yp-_mC23hROy7AgJoRpF0m/view?usp=sharing

This file has data for age, experience, city (denoted by values from 2-9), type of business (denoted by value from 7-8), previous loan (denoted by 0) and final output as YES(1) or NO(0)

I am using below code to train a model and predict weather a user will be allowed loan or not

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model


data = pd.read_csv("test.csv")
data.head()

X = data[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']]
Y = data["Output"]

train = data[:(int((len(data) * 0.8)))]
test = data[(int((len(data) * 0.8))):]

regr = linear_model.LinearRegression()
train_x = np.array(train[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
train_y = np.array(train["Output"])
regr.fit(train_x, train_y)
test_x = np.array(test[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
test_y = np.array(test["Output"])

coeff_data = pd.DataFrame(regr.coef_, X.columns, columns=["Coefficients"])
print(coeff_data)

# Now let's do prediction of data:
test_x2 = np.array([[41, 13, 9, 7, 0]])  # <- Here I am using some random values to test 
Y_pred = regr.predict(test_x2)

Running the above code, I get value of Y_pred as 0.01543 or 0.884 or sometime 1.034. I am not able to understand what this output means. Initially I though may be 0.01543 means low confidence thus loan will not be provided & 0.884 means high confidence so loan will be provided. Is that correct. Can anyone please help me understand it.

Can anyone please provide me link to basic examples of machine learning to get me started on these type of scenarios. Thanks


Solution

  • No, you are doing it wrong! You have to output either 1 or 0. So, this is a classification problem, not regression. Use some classification algorithm like Logistic Regression instead of Linear Regression.

    clf = linear_model.LogisticRegression()
    train_x = np.array(train[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
    train_y = np.array(train["Output"])
    clf.fit(train_x, train_y)
    
    test_x = np.array(test[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
    test_y = np.array(test["Output"])
    
    test_x2 = np.array([[41, 13, 9, 7, 0]])
    Y_pred = clf.predict(test_x2)
    

    And delete that coeff_data line, because it has no use. If you want to check the coefficients, then directly use this code:

    clf.coef_
    

    Check this link, it has a great explanation of loan approval with Machine Learning