I have a word2vec encoding for a set of queries and documents. Some queries and documents are relevant to one another and some are not. I am trying to train a logistic regression model to recognise whether a document is relevant to a query.
Currently I have a pandas data frame named training_data that looks like this (simplified version):
query vector | doc vector | relevant |
---|---|---|
[1,6,4] | [3,6,2] | 1 |
[5,2,1] | [1,1,1] | 0 |
Where all the vectors for query and doc vectors columns are the same length. I have a different similar set-up for my test set also.
My question is, what is the correct syntax for feeding this data into sklearns Logistic Regression model. I am able to do it if the data is a single number and I am able to make one column into a list and feed that in, but how can I do that for 2 sets of features?
for example, if I ignore the query vector column for now I can just do:
y = training_data["relevant"]
X = list(training_data["doc vector"])
clf = LogisticRegression()
clf.fit(X, y)
and that works, but how can I add the query vector column into this? If I try and go straight from the data frame into the model without converting to a list I get "ValueError: setting an array element with a sequence.".
I've tried a few combinations of things including making a 2d array representing the 2 columns but that gave me the same error as above. Help!
If you have one set of N
numeric features, then another set of M
numeric features, you'd generally concatenate them together into one combined "flat" set of (N+M)
feature to make them appropriate for a Scikit-Learn model that expects a single flat array of features as its X
inputs.
There are many ways to do that, manually or with the assistance of other library code.
The most simple approach from your setup would probably be a bit of idiomatic Python like:
X = [list1 + list2 for list1, list2
in zip(training_data["query vector"],
training_data["doc vector"])]
If you were building a more extensive Scikit-Learn pipeline, within its helper classes, the FeatureUnion
class is a utility for combining the results of two different feature-extraction methods. A demo of its use is in the example Concatenating multiple feature extraction methods.
(Note that in the case of your dataframe's rows, each of the two "feature extractions" could just be pulling the existing arrays from their two different cells. So, any 'transformers' you might use, in that style of approach, might do little more than taking some rough not-yet-correctly structured example – like a Pandas row, or a tuple of your 2 arrays – and indexing to access the chosen input elements.)