I am having a few issues with scaling a text matching program. I am using text2vec which provides very good and fast results.
The main problem I am having is manipulating a large matrix which is returned by the text2vec::sim2() function.
First, some details of my hardware / OS setup: Windows 7 with 12 cores about 3.5 GHz and 128 Gb of memory. Its a pretty good machine.
Second, some basic details of what my R program is trying to achieve.
We have a database of 10 million unique canonical addresses for every house / business in address. These reference addresses also have latitude and longitude information for each entry.
I am trying to match these reference addresses to customer addresses in our database. We have about 600,000 customer addresses. The quality of these customer addresses is not good. Not good at all! They are stored as a single string field with absolutely zero checks on input.
The techical strategy to match these addresses is quite simple. Create two document term matrices (DTM) of the customer addresses and reference addresses and use cosine similarity to find the reference address which is the most similar to a specific customer address. Some customer addresses are so poor that will result in a very low cosine similarity -- so, for these addresses a "no match" would be assigned.
Despite being a pretty simple solution, the results obtained are very encouraging.
But, I am having problems scaling things....? And I am wondering if anyone has any suggestions.
There is a copy of my code below. Its pretty simple. Obviously, I cannot include real data but it should provide readers a clear idea of what I am trying to do.
SECTION A - Works very well even on the full 600,000 * 10 million input data set.
SECTION B - the text2vec::sim2() function causes R studio to shut down when the vocabulary exceeds about 140,000 tokens (i.e columns). To avoid this, I process the customer addresses in chunks of about 200.
SECTION C - This is the most expensive section. When processing addresses in chunks of 200, SECTION A and SECTION B take about 2 minutes. But SECTION C, using (what I would have thought to be super quick functions) take about 5 minutes to process to process a 10 million row * 200 column matrix.
Combined, SECIONS A:C take about 7 minutes to process 200 addresses. As there are 600,000 addresses to process, this will take about 14 days to process.
Are they are ideas to make this code run faster...?
rm(list = ls())
library(text2vec)
library(dplyr)
# Create some test data
# example is 10 entries.
# but in reality we have 10 million addresses
vct_ref_address <- c("15 smith street beaconsfield 2506 NSW",
"107 orange grove linfield 2659 NSW",
"88 melon drive calton 3922 VIC",
"949 eyre street sunnybank 4053 QLD",
"12 black avenue kingston 2605 ACT",
"5 sweet lane 2004 wynyard NSW",
"32 mugga way 2688 manuka ACT",
"4 black swan avenue freemantle 5943 WA",
"832 big street narrabeet 2543 NSW",
"5 dust road 5040 NT")
# example is 4 entries
# but in reality, we have 1.5 million addresses
vct_test_address <- c("949 eyre street sunnybank 4053 QLD",
"1113 completely invalid suburb with no post code QLD",
"12 black road kingston 2605 ACT",
"949 eyre roaod sunnybank 4053 QLD" )
# ==========================
# SECTION A ===== prepare data
# A.1 create vocabulary
t2v_token <- text2vec::itoken(c(vct_test_address, vct_ref_address), progressbar = FALSE)
t2v_vocab <- text2vec::create_vocabulary(t2v_token)
t2v_vectorizer <- text2vec::vocab_vectorizer(t2v_vocab)
# A.2 create document term matrices dtm
t2v_dtm_test <- text2vec::create_dtm(itoken(vct_test_address, progressbar = FALSE), t2v_vectorizer)
t2v_dtm_reference <- text2vec::create_dtm(itoken(vct_ref_address, progressbar = FALSE), t2v_vectorizer)
# ===========================
# SECTION B ===== similarity matrix
mat_sim <- text2vec::sim2(t2v_dtm_reference, t2v_dtm_test, method = 'cosine', norm = 'l2')
# ===========================
# SECTION C ===== process matrix
vct_which_reference <- apply(mat_sim, 2, which.max)
vct_sim_score <- apply(mat_sim, 2, max)
# ============================
# SECTION D ===== apply results
# D.1 assemble results
df_results <- data.frame(
test_addr = vct_test_address,
matched_addr = vct_ref_address[vct_which_reference],
similarity = vct_sim_score )
# D.2 print results
df_results %>% arrange(desc(similarity))
The issue in step C is that mat_sim
is sparse and all the apply
calls make column/row subsetting which are super slow (and convert sparse vectors to dense).
There could be several solutions:
mat_sim
is not very huge convert to the dense with as.matrix
and then use apply
Better you can convert mat_sim
to sparse matrix in a triplet format with as(mat_sim, "TsparseMatrix")
and then use data.table
to get indices of the max elements. Here is an example:
library(text2vec)
library(Matrix)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))
mat_sim = sim2(dtm[1:100, ], dtm[101:5000, ])
mat_sim = as(mat_sim, "TsparseMatrix")
library(data.table)
# we add 1 because indices in sparse matrices in Matrix package start from 1
mat_sim_dt = data.table(row_index = mat_sim@i + 1L, col_index = mat_sim@j + 1L, value = mat_sim@x)
res = mat_sim_dt[,
{ k = which.max(value); list(max_sim = value[[k]], row_index = row_index[[k]]) },
keyby = col_index]
res
Also as a side suggestion - I recommend to try char_tokenizer()
with ngrams (for example of the size c(3, 3)
) to "fuzzy" match different spelling and abbreviations of addresses.