python machine-learning classification decision-tree sklearn-pandas

Text Classification using Decision Trees in Python

I am new to Python as well as machine learning. My implementation is based on the IEEE research paper http://ieeexplore.ieee.org/document/7320414/ (Bug report, feature request, or simply praise? On automatically classifying app reviews)

I want to classify text into categories. The text is user reviews from google play store or apple app store. The categories used in the research were Bug, Feature, User Experience, Rating. Given this situation, I am trying to implement a decision tree using sklearn package in python. I came across an example data set provided by sklearn 'IRIS', which builds a tree model using the features and their values mapped to the target. In this example, it is numeric data.

I am trying to classify text instead of numeric data. Examples:

I liked very much the upgrade to pdfs. However, they aren't displaying anymore Fix it and it will be perfect [BUG]
I just wish it would notify me if I go below a certain dollar amount [FEATURE]
This app is very helpful in my line of business [Rating]
Easy to find songs and purchase in iTunes [UserExperience]

Given these texts and lot more user reviews of these categories, I want to create a classifier that can train using the data and predict the target of any given user reviews.

So far I have pre-processed the text and created training data in the form of list of tuples that contains the pre-processed data and its target.

My Pre-processing:

Tokenize multi-line comments to into single sentences
Tokenize each sentence into words
Remove stop words in the tokenized sentence
Lemmatize the words in the tokenized sentence

(['i', 'liked', 'much', 'upgrade', 'pdfs', 'however', 'displaying', 'anymore', 'fix', 'perfect'], "BUG")

Here's what I have so far:

import json
from sklearn import tree
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, RegexpTokenizer

# define a tokenizer to tokenize sentences and also remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

# this list stores all the training data along with it's label
tagged_tokenized_comments_corpus = []


# Method: to add data to training set
# Parameter: Tuple in the format (Data, Label)
def tag_tokenized_comments_corpus(*tuple_data):
tagged_tokenized_comments_corpus.append(tuple_data)


# step 1: Load all the stop words from the nltk package
stop_words = stopwords.words("english")
stop_words.remove('not')

# creating a temporary list to copy the existing stop words
temp_stop_words = stop_words

for word in temp_stop_words:
if "n't" in word:
    stop_words.remove(word)

# load the data set
files = ["Bug.txt", "Feature.txt", "Rating.txt", "UserExperience.txt"]

d = {"Bug": 0, "Feature": 1, "Rating": 2, "UserExperience": 3}

for file in files:
input_file = open(file, "r")
file_text = input_file.read()
json_content = json.loads(file_text)

# step 3: Tokenize multi sentence into single sentences from the user comments
comments_corpus = []
for i in range(len(json_content)):
    comments = json_content[i]['comment']
    if len(sent_tokenize(comments)) > 1:
        for comment in sent_tokenize(comments):
            comments_corpus.append(comment)
    else:
        comments_corpus.append(comments)

# step 4: Tokenize each sentence, remove stop words and lemmatize the comments corpus
lemmatizer = WordNetLemmatizer()
tokenized_comments_corpus = []
for i in range(len(comments_corpus)):
    words = tokenizer.tokenize(comments_corpus[i])
    tokenized_sentence = []
    for w in words:
        if w not in stop_words:
            tokenized_sentence.append(lemmatizer.lemmatize(w.lower()))
    if tokenized_sentence:
        tokenized_comments_corpus.append(tokenized_sentence)
        tag_tokenized_comments_corpus(tokenized_sentence, d[input_file.name.split(".")[0]])

# step 5: Create a dictionary of words from the tokenized comments corpus
unique_words = []
for sentence in tagged_tokenized_comments_corpus:
for word in sentence[0]:
    unique_words.append(word)
unique_words = set(unique_words)

dictionary = {}
i = 0
for dict_word in unique_words:

dictionary.update({i, dict_word})
i = i + 1


train_target = []
train_data = []
for sentence in tagged_tokenized_comments_corpus:
train_target.append(sentence[0])
train_data.append(sentence[1])

clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)

test_data = "Beautiful Keep it up.. this far is the most usable app editor.. 
it makes my photos more beautiful and alive.."

test_words = tokenizer.tokenize(test_data)
test_tokenized_sentence = []
for test_word in test_words:
    if test_word not in stop_words:
     test_tokenized_sentence.append(lemmatizer.lemmatize(test_word.lower()))

#predict using the classifier
print("predicting the labels: ")
print(clf.predict(test_tokenized_sentence))

However, This doesn't seem to work since it throws an error during run time when we train the algorithm. I was thinking If I can map the words in the tuple to the dictionary and convert the text into numeric form and train the algorithm. But I am not sure if this may work.

Can anyone suggest how can I fix this code? Or if there is any better way to implement this decision tree.

Traceback (most recent call last):
  File "C:/Users/venka/Documents/GitHub/RE-18/Test.py", line 87, in <module>
clf.fit(train_data, train_target)
  File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
 File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 116, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.  0.  0. ...,  3.  3.  3.].
Reshape your data either using array.reshape(-1, 1) if your data has a 
single feature or array.reshape(1, -1) if it contains a single sample.

Solution

Decision trees can only work when your feature vectors are all the same length. Personally I've got no clue as to how effective Decision Trees would be at text analysis like this, but if you're to try and go for it, the way I'd suggest is a "one-hot" "bag of words" style vector.

Essentially, keep tag of how many times words appear in your example, and put them in a vector that represents the whole corpus. Say, once you removed all the stop-words the set of the entire corpus was:

{"Apple", "Banana", "Cherry", "Date", "Eggplant"}

You represent this by a vector the same size as the corpus, with each value representing whether or not the word appears. In our example, a 5 length vector where the first element is associated with "Apple", the second with "Banana" and so on. You might get something like:

bag("Apple Banana Date")
#: [1, 1, 0, 1, 0]
bag("Cherry")
#: [0, 0, 1, 0, 0]
bag("Date Eggplant Banana Banana")
#: [0, 1, 0, 1, 1]
# For this case, I have no clue if Banana having the value 2 would improve results.
# It might. It might not. Something you'd need to test.

This way, you have the same sized vector regardless of the input, and the decision tree knows where to look for certain outputs. Say "Banana" corresponds strongly to bug reports, in which case the decision tree will know that a 1 in the second element means a bug report is more likely.

Of course, your corpus might be thousands of words long. In that case, your decision tree probably won't be the best tool for the job. Not unless you first take some time to trim down your features.