Search code examples
algorithmdata-miningtext-processingsimilarity

how to get the similar texts from a lot of pages?


get the x most similar texts from a lot of texts to one text.

maybe change the page to text is better.

You should not compare the text to every text, because its too slow.


Solution

  • I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.