I am reading a CSV file and I am needing, for modeling purposes, to create a Target (Y) and X variables. Not sure how to set that up. I am new to coding and needing some guidance that I can't seem to understand from Pandas docs. I would like to have Target as 'Bad Indicator' and 'X' as all other columns.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
project = pd.read_csv('c:/users/Brandon Thomas/Project.csv')
project=pd.DataFrame(project)
df = pd.DataFrame(project.data, columns = project.feature_names)
df["Bad Indicator"] = x.target
X = df.drop("Bad Indicator",axis=1) #Feature Matrix
y = df["Bad Indicator"] #Target Variable
df.head()
AttributeError Traceback (most recent call last) in 1 # Build dataframe ----> 2 df = pd.DataFrame(project.data, columns = project.feature_names) 3 df["Bad Indicator"] = x.target 4 X = df.drop("Bad Indicator",axis=1) #Feature Matrix 5 y = df["Bad Indicator"] #Target Variable
~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self,
name)
5065 if
self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068
5069 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'data'
In your code above you create a dataframe 3 separate times. Once with pd.read_csv
, once with project = pd.DataFrame(project)
and once more with with df = pd.DataFrame(...)
. By default, pd.read_csv
object will be a dataframe.
I have taken out currently unnecessary imports such as numpy, scipy, and matplotlib. You can add them back if you need them later. To set up Y and X, all you need to do is:
import pandas as pd
df = pd.read_csv('c:/users/Brandon Thomas/Project.csv') # this will automatically name your columns if your csv has headers
#if your csv does not have headers:
df.columns = ['Bad Indicator', 'ColumnName1', 'ColumnName2',..]
X = df.drop("Bad Indicator",axis=1) #Feature Matrix
Y = df["Bad Indicator"] #Target Variable
df.head()
If your csv does have headers, remove the df.columns
line.