I am new to python and currently doing a project in it. I have Audio and Lyrical data of songs. While training the model with audio features, scaling was done easily. but when i use for lyrical i.e. textual data it gives this error. i have converted the textual data to numerical using CountVectorizer. this is my code.
lyr = pd.read_csv('ly.csv',encoding = "ISO-8859-1")
X = lyr.lyrics
y = lyr.terms
text_train, text_test, y_train, y_test = train_test_split(X, y)
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
X_test = vect.transform(text_test)
# compute the mean value per feature on the training set
mean_on_train = X_train.mean(axis=0)
# compute the standard deviation of each feature on the training set
std_on_train = X_train.std(axis=0)
# afterwards, mean=0 and std=1
X_train_scaled = (X_train - mean_on_train) / std_on_train
X_test_scaled = (X_test - mean_on_train) / std_on_train
mlp = MLPClassifier(random_state=0)
mlp.fit(X_train_scaled, y_train)
print("accuracy on training set: %f" % mlp.score(X_train_scaled, y_train))
print("accuracy on test set: %f" % mlp.score(X_test_scaled, y_test))
and this is the error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-d65d865b4b90> in <module>()
3 mean_on_train = X_train.mean(axis=0)
4 # compute the standard deviation of each feature on the training set
----> 5 std_on_train = X_train.std(axis=0)
6 # afterwards, mean=0 and std=1
7 X_train_scaled = (X_train - mean_on_train) / std_on_train
C:\ProgramData\Anaconda3\lib\site-packages\scipy\sparse\base.py in
__getattr__(self, attr)
574 return self.getnnz()
575 else:
--> 576 raise AttributeError(attr + " not found")
577
578 def transpose(self, axes=None, copy=False):
AttributeError: std not found
Regards
You're having two separate problems.
First is that you're trying to call a method on an object that doesn't support that method. If you look at the docs for scipy.sparse.csr.csr_matrix you see that it doesn't have a 'std' method. (It has mean, maximum, and some other stuff). I just linked to the current version of scipy, I'm not sure what version you are using. Why doesn't it have it built in? I don't know. It might not be a universally applicable application to that object, so it might not be 'safe' to do always. Or the developers didn't get to it yet, or took it out, etc. Any number of possibilities, but your clue was:
AttributeError: std not found
Second, based on your comment above if it's "alright" to just take X_train/mean; that depends on what you want. If you take a series of numbers and divide them by the mean value, you're basically calculating a percentage. If you want standardized values (which you usually do for machine learning), then you actually do want the standard deviation. I'll show you an example using numpy, which are just simple arrays.
>>> import numpy as np
>>> x = [2,3,3,4,4,4,4,5,5]
>>> np.std(x)
0.9162456945817024
>>> np.mean(x)
3.7777777777777777
Calculating the standard deviation is straightforward: I broke it down into the squared differences, the variance, and then the third one is the actual standard deviation:
>>> ((x-np.mean(x))**2)
array([3.16049383, 0.60493827, 0.60493827, 0.04938272,
0.04938272, 0.04938272, 1.49382716, 1.49382716])
>>> np.sum(((x-np.mean(x))**2))/len(x)
0.8395061728395062
>>> np.sqrt(np.sum(((x-np.mean(x))**2))/9)
0.9162456945817024
So, you can use that to generate what you were after:
>>> (x-np.mean(x))/0.916245 #could also use np.std(x)
>>> array([-1.94028647, -0.84887533, -0.84887533, 0.24253581, 0.24253581,
0.24253581, 0.24253581, 1.33394695, 1.33394695])
Compare that to the values you generated (2 is 57% of 3.77777, etc) not exactly the same
>>> x/np.mean(x)
array([0.52941176, 0.79411765, 0.79411765, 1.05882353, 1.05882353,
1.05882353, 1.05882353, 1.32352941, 1.32352941])