Search code examples
nlpword2vecword-embeddingdoc2veclightgbm

Can I interpret doc2vec components?


I am solving a binary text classification problem with corporate filings. Using Doc2Vec embeddings of length 100 with LightGBM is producing great results. However, for this project it would be very valuable to approximate a thematic meaning for at least one of the components. Ideally, this would be a feature ranked with high importance by LightGBM explained anecdotally with a few examples.

Has anyone attempted this, or should interpretation be off the table for a high-dimensional model with this level of complexity?


Solution

  • The individual dimensions of a Doc2Vec representation should not be considered independent, interpretable features. They're only useful in concert with each other, and the exact directions aligned with individual coordinate-axes may not be strongly meaningful in any human-describable sense.

    However, neighborhoods of the space may loosely fit describable themes, and certain directions (not specifically parallel with coordinate-axes) may loosely fit semantic themes.

    But to characterize those, you might try to find the centroid points of groups-of-related-documents, or discovered clusters, and compare the relative distances/directions between those centroids.