Gensim word2vec and large amount of texts

I need to put the texts contained in a column of a MySQL database (about 3 million rows) into a list of lists of tokens. These texts (which are tweets, therefore they are generally short) must be preprocessed before being included in the list (stop words, hashtags, tags etc. must be removed). This list should be passed later as a Word2Vec parameter. This is the part of the code involved

import mysql.connector
import re
from gensim.models import Word2Vec
import preprocessor as p
p.set_options(
    p.OPT.URL,
    p.OPT.MENTION,
    p.OPT.HASHTAG,
    p.OPT.NUMBER
)

conn = mysql.connector.connect(...)
cursor = conn.cursor()
query = "SELECT text FROM tweet"
cursor.execute(query)
table = cursor.fetchall()

stopwords = open('stopwords.txt', encoding='utf-8').read().split('\n')
sentences = []
for row in table:
    sentences = sentences + [[w for w in re.sub(r'[^\w\s-]', ' ', p.clean(row[0])).lower().split() if w not in stopwords and len(w) > 2]]

cursor.close()
conn.close()

model = Word2Vec(sentences)
...

Obviously it takes a lot of time and I know that my method is probably inefficient. Can anyone recommend a better one? I know it is not a question directly related to gensim and Word2Vec but perhaps those who use them have already faced the problem of working with a large amount of texts.

Solution

You haven't mentioned how long your code takes to run, but some potential sources of slowdown in your current technique might include:

the overhead of regex-based preprocessing, especially if a large number of independent regexes are each applied, separately, to the same texts
the inefficiency of expanding a Python list by appending one new item at a time - which as the list grows larger can sometimes be a factor
virtual-memory swapping, if the size of your data exceeds physical RAM

You can check the swapping issue by monitoring memory use using a platform-specific tool (like top on Linux systems) to view memory usage during the operation. If that's a contributor, using a machine with more RAM, or making other code changes to reduce RAM usage (see below), will help.

Your full prprocessing code isn't shown, but a common approach is a lot of independent steps, each of which involves one or more regular-expressions, but then returns a plain modified string (for future steps).

As appealingly simple & pluggable as that is, it often becomes a source of avoidable slowness in preprocessing large amounts of text. For example, each regex/step itself might have to repeat detecting token-boundaries, or splitting then re-concatenating a string. Or, the regexes might use complex match patterns, or techniques (like backtracking) that can be expensive on worst-case inputs.

Often this sort of preprocessing can be greatly improved by one or more of:

coalescing multiple regexes into a single step, so a string faces one front-to-back pass, rather than N
breaking into short tokens early, then leaving the text as a list-of-tokens for later steps - thus never redundantly splitting/joining, and letting later token-oriented steps to work on smaller strings and perhaps even simpler (non-regex) string-tests

Also, even if the preprocessing is still a bit time-consuming, a big process improvement is usually to be sure to only repeat it when the data changes. That is, if you're going to try a bunch of different downstream steps, like different Word2Vec parameters, make sure you're not doing the expensive preprocessing every time. Do it once, write the results aside to a file, then reuse the results file until it needs to be regenerated (because the data or preprocessing rules have changed).

Finally, if the append-one-more pattern is contributing to your slowness, you could pre-allocate your sentences (sentences = [Null,] * desired_length), then replace each row in your loop rather than append (sentences[row_num] = preprocessed_text). But that might not be a major factor, and in fact the suggestion above, about "reuse the results file", is a better way to minimize list-ops/RAM-usage, as well as enable reuse across alternate runs.

That is, open a new working file before your loop. Append each preprocessed text – with spaces between the tokens, and a newline at the end – as one new line to this file. Then, have your Word2Vec step work directly from that file. (In Gensim, you can do this by wrapping the file with a LineSentence utility object, which reads a file of that format back as a re-iterable sequence, with each item being a list-of-tokens, or by using the corpus_file parameter to feed the filename directly to Word2Vec.)

From that list of possible tactics, I'd try:

First, time your existing code for preprocessing (creating your sentences
Then, eliminate all fancy preprocessing, doing nothing more complicated than .split(), and re-time. If there's a big change, then yes, the preprocessing is the major slowdown, and concentrate on improving that.
If even that minimal preprocessing still seems slower-than-desired, then maybe the RAM/concatenation issues are a concern, and try writing to an interim file.

Separately: it's not strictly necessary to worry about removing stop-words in word2vec training - much published work doesn't bother with that step, and the algorithm already includes a sample parameter which causes it to skip a lot of the very-overrepresented words during training as less-interesting. Similarly, 2- and even 1- character tokens may still be interesting, especially in the domain of tweets, so you might not want to always discard them. (For example, lone emoji can be significant 'words'.)