python machine-learning nlp data-analysis

How to use ML Algoritms with feature vector data from bag of words?

I'm making a program that can predict the corresponding Business unit according to the data in text. I've set up a vocabulary to find word occurances in the text that correspond to a certain unit but I'm unsure on how to use that data with Machine Learning models for predictions.

There are four units that it can possibly predict which are: MicrosoftTech, JavaTech, Pythoneers and JavascriptRoots. In the vocabulary I have put words that indicate to certain units. For example JavaTech: Java, Spring, Android; MicrosoftTech: .Net, csharp; and so on. Now I'm using the bag of words model with my custom vocabulary to find how often those words occur.

This is my code for getting the word count data:

def bagOfWords(description, vocabulary):
    bag = np.zeros(len(vocabulary)).astype(int)
    for sw in description:
        for i,word in enumerate(vocabulary):
            if word == sw: 
                bag[i] += 1
    print("Bag: ", bag)
    return bag

So lets say the vocabulary is: [java, spring, .net, csharp, python, numpy, nodejs, javascript]. And the description is: "Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."

Running the code would output the following: Bag: [3,1,0,0,0,0,0,0]

How do I use this data for my predictions with ML algoritms?

My code so far:

import pandas as pd
import numpy as np
import warnings
import tkinter as tk
from tkinter import filedialog
from nltk.tokenize import TweetTokenizer

warnings.filterwarnings("ignore", category=FutureWarning)
root= tk.Tk()

canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue')
canvas1.pack()

def getExcel ():
    global df
    vocabularysheet = pd.read_excel (r'Filepath\filename.xlsx')
    vocabularydf = pd.DataFrame(vocabularysheet, columns = ['Word'])
    vocabulary = vocabularydf.values.tolist()
    unitlabelsdf = pd.DataFrame(vocabularysheet, columns = ['Unit'])
    unitlabels = unitlabelsdf.values.tolist()
    for voc in vocabulary:
        index = vocabulary.index(voc)
        voc = vocabulary[index][0]
        vocabulary[index] = voc
    for label in unitlabels:
        index = unitlabels.index(label)
        label = unitlabels[index][0]
        unitlabels[index] = label

    import_file_path = filedialog.askopenfilename()
    testdatasheet = pd.read_excel (import_file_path)
    descriptiondf = pd.DataFrame(testdatasheet, columns = ['Description'])
    descriptiondf = descriptiondf.replace('\n',' ', regex=True).replace('\xa0',' ', regex=True).replace('•', ' ', regex=True).replace('u200b', ' ', regex=True)
    description = descriptiondf.values.tolist()
    tokenized_description = tokanize(description)
    for x in tokenized_description:
        index = tokenized_description.index(x)
        tokenized_description[index] = bagOfWords(x, vocabulary)

def tokanize(description): 
    for d in description:
        index = description.index(d)
        tknzr = TweetTokenizer()
        tokenized_description = list(tknzr.tokenize((str(d).lower())))
        description[index] = tokenized_description
    return description

def wordFilter(tokenized_description):
    bad_chars = [';', ':', '!', "*", ']', '[', '.', ',', "'", '"']
    if(tokenized_description in bad_chars):
        return False
    else:
        return True

def bagOfWords(description, vocabulary):
    bag = np.zeros(len(vocabulary)).astype(int)
    for sw in description:
        for i,word in enumerate(vocabulary):
            if word == sw: 
                bag[i] += 1
    print("Bag: ", bag)
    return bag


browseButton_Excel = tk.Button(text='Import Excel File', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))
predictionButton = tk.Button(text='Button', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))
canvas1.create_window(150, 150, window=browseButton_Excel)

root.mainloop()

Solution

You knew already how to prepare data set for training.

It's an example what I make to explain:

voca = ["java", "spring", "net", "csharp", "python", "numpy", "nodejs", "javascript"]

units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
desc1 = "Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."
desc2 = "Company Y is looking for a csharp Developer. Requirements: Has wored with csharp. 5+ years experience with csharp, Net."

x_train = []
y_train = []

x_train.append(bagOfWords(desc1, voca))
y_train.append(units.index("JavaTech"))
x_train.append(bagOfWords(desc2, voca))
y_train.append(units.index("MicrosoftTech"))

And, We got 2 Training data set:

[array([3, 1, 0, 0, 0, 0, 0, 0]), array([0, 0, 1, 3, 0, 0, 0, 0])] [1, 0]

array([3, 1, 0, 0, 0, 0, 0, 0]) => 1 (It means JavaTech)
array([0, 0, 1, 3, 0, 0, 0, 0]) => 0 (It means MicrosoftTech)

And, The model needs to predict one unit in 4 units what you defined. So, We need a classification machine learning model. A Classification machine learning model requires 'softmax' as an activation function of the output layer. And, a 'crossentropy' loss function is required. It is very simple deep learning model written by keras apis of tensorflow.

import tensorflow as tf
import numpy as np

units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]
x_train = np.array([[3, 1, 0, 0, 0, 0, 0, 0],
                [1, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 1, 1, 0, 0, 0, 0],
                [0, 0, 2, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 2, 1, 0, 0],
                [0, 0, 0, 0, 1, 2, 0, 0],
                [0, 0, 0, 0, 0, 0, 1, 1],
                [0, 0, 0, 0, 0, 0, 1, 0]])
y_train = np.array([0, 0, 1, 1, 2, 2, 3, 3])

And, It is the model consist of one hidden layer with 256 cells and output layer with 4 cells.

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])
model.compile(optimizer='adam',
                         loss='sparse_categorical_crossentropy',
                         metrics=['accuracy'])

I set epochs as 50. You need to see loss and acc while it is running to learn. Actually, 10 was not enough. And I will start to learn.

model.fit(x_train, y_train, epochs=50)

And, It's a part of prediction. newSample is just sample what I made.

newSample = np.array([[2, 2, 0, 0, 0, 0, 0, 0]])
prediction = model.predict(newSample)
print (prediction)
print (units[np.argmax(prediction)])

Finally, I got a result as below:

[[0.96280855 0.00981709 0.0102595  0.01711495]]
MicrosoftTech

It means the possibility of each unit. And the highest possibility is MicrosoftTech.

MicrosoftTech : 0.96280855
JavaTech : 0.00981709
....

And, It is the result of learning steps. You can see that loss is being reduced consistently. So, I increased the number of epochs.

Epoch 1/50
8/8 [==============================] - 0s 48ms/step - loss: 1.3978 - acc: 0.0000e+00
Epoch 2/50
8/8 [==============================] - 0s 356us/step - loss: 1.3618 - acc: 0.1250
Epoch 3/50
8/8 [==============================] - 0s 201us/step - loss: 1.3313 - acc: 0.3750
Epoch 4/50
8/8 [==============================] - 0s 167us/step - loss: 1.2965 - acc: 0.7500
Epoch 5/50
8/8 [==============================] - 0s 139us/step - loss: 1.2643 - acc: 0.8750
........
........
Epoch 45/50
8/8 [==============================] - 0s 122us/step - loss: 0.3500 - acc: 1.0000
Epoch 46/50
8/8 [==============================] - 0s 140us/step - loss: 0.3376 - acc: 1.0000
Epoch 47/50
8/8 [==============================] - 0s 134us/step - loss: 0.3257 - acc: 1.0000
Epoch 48/50
8/8 [==============================] - 0s 137us/step - loss: 0.3143 - acc: 1.0000
Epoch 49/50
8/8 [==============================] - 0s 141us/step - loss: 0.3032 - acc: 1.0000
Epoch 50/50
8/8 [==============================] - 0s 177us/step - loss: 0.2925 - acc: 1.0000