Search code examples
machine-learningnlpsentiment-analysislightgbm

What are the best Pre-Processing techniques for Sentiment Analysis.?


I am trying to classify a dataset of reviews in to two classes say class A and class B. I am using LightGBM to classify.

I have changed the parameters for the classifier many times but I can't get a huge difference in the results.

I think the problem is with the pre-processing step. I defined a function as shown below to take care of pre-processing. I used Stemming and removed stopwords. I don't know what I am missing. I have tried LancasterStemmer and PorterStemmer

stops = set(stopwords.words("english"))
def cleanData(text, lowercase = False, remove_stops = False, stemming = False, lemm = False):
    txt = str(text)
    txt = re.sub(r'[^A-Za-z0-9\s]',r'',txt)
    txt = re.sub(r'\n',r' ',txt)

    if lowercase:
        txt = " ".join([w.lower() for w in txt.split()])

    if remove_stops:
        txt = " ".join([w for w in txt.split() if w not in stops])

    if stemming:
        st = PorterStemmer()
        txt = " ".join([st.stem(w) for w in txt.split()])

    if lemm:
        wordnet_lemmatizer = WordNetLemmatizer()
        txt = " ".join([wordnet_lemmatizer.lemmatize(w) for w in txt.split()])
    return txt

Are there any more pre-processing steps to be done to get a better accuracy.?

URL for the dataset : Dataset

EDIT :

Parameters that I used are as mentioned below.

params = {'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'learning_rate': 0.01, 
    'max_depth': 22, 
    'num_leaves': 78,
    'feature_fraction': 0.1, 
    'bagging_fraction': 0.4, 
    'bagging_freq': 1}

I have altered the depth and num_leaves parameters along with others. But the accuracy is kind of stuck at a certain level..


Solution

  • There are a few things to consider. First of all your training set is not balanced - the class distribution is ~ 70%/30%. You need to consider this fact in training. What types of features are you using? Using the right set of features could improve your performance.