Search code examples
machine-learningnlpweka

how to use string data for svm (smo) in weka


I have an arff file containing some sentences (Persian language) and a word in front of each sentence which shows its class in @data part. I need to use smo for classification. The questions:

1) Is it necessary to change the sentences to vectors ?

2) I selected "string to word vector", but the smo is inactive and still doesn't work. (and of course other algorithms like naive bayes).

How can I use this text data with smo ?

enter image description here

The above picture is a very small sample file.

file sample: https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0

enter image description here


Solution

  • First, you need apply "string to word vector" filter. After, on classify tab, you need to change the target class to "(Nom) class". This is enought to enable the naive bayes and SVM algorithms. I downloaded the dataset, and it worked well.

    You can follow this tutorial: https://www.youtube.com/watch?v=zlVJ2_N_Olo

    Hope it can help you

    from sklearn.feature_extraction.text import TfidfVectorizer
    import arff
    from sklearn import svm
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    data=list(arff.load('shoor.arff'))
    
    text=[]
    label=[]
    for r in data:
        if (len(r)>1):
            text.append(r[0])
            label.append(r[1])
    tfidf = TfidfVectorizer().fit_transform(text)
    features = (tfidf * tfidf.T).A
    
    
    X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.5, random_state=0)
    clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
    clf.score(X_test, y_test)
    

    1.0