Search code examples
sqldatabasesqlitecorpuslinguistics

Looking up a word's sentences in a corpus of 15 million words


I have a corpus of 15 million words, which I'd like to store in a database. I'd then like to be able to find for a given word, its context within the corpus. For example, for the word "friends" I might select the following, where I am also selection five words before and after each "friends":

... night i went to my FRIENDS house for a cup of tea ...
... what did you say my FRIENDS cat is sick and ...
... if you like my FRIENDS dad can pick you up ...

How best might I organise my database to efficiently select for a given word in such a manner? I usually use sqlite when I need a database but maybe something else is better in this case.


Solution

  • If you want to find a word in a corpus, then you need full text search capabilities. SQLite does actually offer such capabilities as an extension, which are explained here.

    Full text search is going to return the document that matches a given query. You will first need to break up the corpus into separate documents. Usually, this is a very easy task -- the documents might be emails, or customer service records, or doctor's notes, or reports, or whatever. However, you do not describe what the documents are in your case.

    I am not at all familiar with the full-text extensions to SQLite. You might consider other database solutions like MySQL which also offer full text support.