Search code examples
nlpword2vecldasummarization

Combine Word Embeddings with with topic-word distribution from LDA for text summarization


Im a newbie in NLP and i was wondering if it is a good idea to summarize a document that has already been classified into a certain topic through methods such as LDA by considering the Word Embedding retrieved from Word2Vec and the topic-word distribution that has already been generated, to come up with a sentence scoring algorithm. Does this sound like a good approach for creating a summary of a document?


Solution

  • I would like to suggest you this post.

    Instead of using Skip-Thought Encoder on the Step 4, you could use pre-trained Word2Vec model from Google or Facebook (check FastText documentation to see how to parse second model or to choose another language).

    In general, you will have next steps:

    1. Text cleaning (delete numbers, but leave punctuation).
    2. Language detection (to define and delete stopwords, and use appropriate version of Word2Vec model).
    3. Sentence tokenization (after that you could delete punctuation).
    4. Tokens encoding (with chosen Word2Vec model).
    5. Clustering obtained tokens with Kmeans (you should specify number of clusters - it will be equal to number of sentences in the future summary).
    6. Obtaining summaries (one sentence of the summary is a middle sentence of the one cluster, look original post for more details and code samples).

    I hope it will help. Good luck! :)