I am new to NLP and doc2Vec.
def get_vectors(model, tagged_docs):
sents = tagged_docs.values
targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words)) for doc in sents])
return targets, regressors
model= Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model.build_vocab(train_tagged.values)
model.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=1)
y_train_doc , X_train_doc = get_vectors(model, train_tagged)
print(X_train_doc)
(array([ 0.168, -0.36 , -0.13], dtype=float32),
array([ 0.185, 0.17, 0.04], dtype=float32),....)
X_train_doc is a tuple of array. So for each array, do I input each element into different columns in a df like below?
doc | Income | doc_feature1 | doc_feature2| doc_feature3 |
1 | 10000 | 0.168 | -0.36 | -0.13 |
2 | 500 | 0.185 | 0.17 | 0.04 |
That's going to depend on exactly what downstream libraries/models you're using. In general, you'd not want to re-introduce the overhead of Pandas Dataframes – downstream models are more likely to use raw numpy
arrays.
If using scikit-learn
pipelines, the FeatureUnion
class may be of use:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
If instead you've got, say…
numpy
array of 10000 rows with 300 dimensions each for your 10000 elements' doc-vectors, then…then you may want to concatenate those horizontallly, into 10000 rows of 305 dimensions each, using something like the numpy
hstack
function (one of a variety of options):
https://numpy.org/doc/stable/reference/generated/numpy.hstack.html#numpy.hstack