Search code examples
machine-learningnlptext-analysis

How to combine TFIDF features with other features


I have a classic NLP problem, I have to classify a news as fake or real.

I have created two sets of features:

A) Bigram Term Frequency-Inverse Document Frequency

B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...

Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone.


Solution

  • Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.

    Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).

    That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.