Search code examples
doc2vec

How do I add extra features to the array created by doc2Vec model.infer_vector?


I am new to NLP and doc2Vec.

  1. I used doc2vec to generate an array for each document.
  2. I want to use the array and extra features (eg Income) as features for another model like Logistic Regression. How do I combine the doc array and extra features?
def get_vectors(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words)) for doc in sents])
    return targets, regressors

model= Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model.build_vocab(train_tagged.values)
model.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=1)

y_train_doc , X_train_doc = get_vectors(model, train_tagged)

print(X_train_doc)
(array([ 0.168, -0.36 , -0.13], dtype=float32), 
 array([ 0.185,  0.17, 0.04], dtype=float32),....)

X_train_doc is a tuple of array. So for each array, do I input each element into different columns in a df like below?

doc | Income | doc_feature1 | doc_feature2|  doc_feature3 |
  1 | 10000  |  0.168       | -0.36       | -0.13         |
  2 |  500   |  0.185       | 0.17        |  0.04         |


Solution

  • That's going to depend on exactly what downstream libraries/models you're using. In general, you'd not want to re-introduce the overhead of Pandas Dataframes – downstream models are more likely to use raw numpy arrays.

    If using scikit-learn pipelines, the FeatureUnion class may be of use:

    https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html

    If instead you've got, say…

    • a numpy array of 10000 rows with 300 dimensions each for your 10000 elements' doc-vectors, then
    • 10000 rows of 5 dimensions each for your 10000 elements' other features, in the same order

    …then you may want to concatenate those horizontallly, into 10000 rows of 305 dimensions each, using something like the numpy hstack function (one of a variety of options):

    https://numpy.org/doc/stable/reference/generated/numpy.hstack.html#numpy.hstack