Search code examples
pythonmachine-learningscikit-learnpcafeature-detection

Error with pca and randomized lasso


There are two .csv files containing Tweets and a classification for each Tweet: pos, neg and neutral. class means classification and text a Tweet.

This is my code:

def prediction():
    print("Reading files...")

    #Will learn from this data set.
    train = file2SentencesArray('twitter-sanders-apple3')

    #Test dataset.
    test = file2SentencesArray('twitter-sanders-apple2')
    print("Complete!")

    print("Cleaning sentences...")
    #cleanSenteces will remove html, stop words and some characters.
    cleanTrainSentences = cleanSentences(train["text"])
    cleanTestSentences = cleanSentences(test["text"])
    print("Complete!...")

    print("Fiting sentences...")
    vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)
    trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences)
    np.asarray(trainDataFeatures)

    testDataFeatures = vectorizer.transform(cleanTestSentences)
    np.asarray(testDataFeatures)

    #Getting error here.
    randomized_lasso = RandomizedLasso()
    randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures)
    trainDataFeatures = randomized_lasso.transform(trainDataFeatures)

    #and here.
    #pca = decomposition.PCA(n_components=2)
    #pca.fit_transform(trainDataFeatures)
    #trainDataFeatures = pca.transform(trainDataFeatures)
    print("Complete!")

    print("Predicting...")
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainDataFeatures, train["class"])
    result = forest.predict(testDataFeatures)
    print("Complete...")

    return result

The Randomized lasso and the PCA are both throwing exceptions:

PCA – PCA does not support sparse input.

Randomized lasso – bad input shape

My trainDataFeatures looks like this:

(0, 573)   1
(0, 1411)  2
(0, 2748)  1
(0, 1073)  1
(1, 126)   1
(2, 1203)  1

Solution

  • The input format for both PCA and Randomized Lasso is not correct. Please replace the following two lines and try again.

    np.asarray(trainDataFeatures)
    np.asarray(testDataFeatures)
    # replace the above two lines with these
    trainDataFeatures = trainDataFeatures.toarray()
    testDataFeatures = testDataFeatures.toarray()