I have a dataset which has 3 different columns of relevant text information which I want to convert into doc2vec vectors and subsequently classify using a neural net. My question is how do I convert these three columns into vectors and input into a neural net?
How do I input the concatenated vectors into a neural network?
One way is to get a doc2vec
vector for all three documents in a defined order
and append them together. Then fit the resulting vector to your neural network.
Another way is to create a column in which each row is a list of 3 strings (representing the three documents) and getting one vector representation of all three documents. See some example code below.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
model.infer_vector(['theis is a sentence1', 'here is another sentence', 'this represents the third sentence']).tolist()
Once this is done you can initialize your model and train it.
To fit an sklearn clasifier
for example sgd
, checkout the code snippets below.
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.0)
d = pd.DataFrame({'vectors':[[1,2,3], [3,6,5], [9,2,4], [1,2,7]], "targets": ['class1', 'class1', 'class2', 'class2']})
d
>>>
vectors targets
0 [1, 2, 3] class1
1 [3, 6, 5] class1
2 [9, 2, 4] class2
3 [1, 2, 7] class2
You can fit an sklearn clasiifier on the vector as follows.
clf.fit(X = d.vectors.values.tolist(), y =d.targets)
>>>
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
You can then use this classifier to predict values.