I'm using some machine learning from the SBERT python module to calculate the top K most common strings given an input coprus and a target corpus (in this case 100K vs 100K in size).
The module is pretty robust and gets the comparison done pretty fast,returning me a list of dictionaries containing the top-K most similar comparisons for each input string in the format:
{Corpus ID : Similarity_Score}
I can then wrap it up in a dataframe with the query string list used as an index. Getting me a dataframe in the format:
Query_String | Corpus_ID | Similarity_Score
The main time-sink with my approach however is matching up the Corpus ID with the string in the Corpus so I know what string the input is matched against. My current solution is using a combination of pandas apply
with the pandarallel module:
def retrieve_match_text(row, corpus_list):
dict_obj = row['dictionary']
corpus_id = dict_obj['corpus_id'] #corpus ID is an integer representing the index of a list
score = dict_obj['score']
matched_corpus_keyword = corpus_list[corpus_id] #list index lookup (speed this up)
return [matched_corpus_keyword, score]
.....
.....
# expand the dictionary into two columns and match the corpus KW to its ID
output_df[['Matched Corpus KW', 'Score']] = output_df.parallel_apply(
lambda x: pd.Series(retrieve_match_text(x, sentence_list_2)), axis=1)
This takes around 2 minutes to do for an input corpus of 100K against another corpus of 100K in size. However I'm dealing with a corpus in the size of several million so any further increase in speed here is welcomed.
If I read the question correctly, you have the columns: Query_String and dictionary (is this correct?).
And then corpus_id and score are stored in that dictionary.
Your first target with pandas should be to work in a pandas-friendly way. Avoid the nested dictionary, store values directly in columns. After that, you can use efficient pandas operations.
Indexing a list is not what is slow for you. If you do this correctly it can be a whole-table merge/join and won't need any slow row-by-row apply and dictionary lookups.
Step 1. If you do this:
target_corpus = pd.Series(sentence_list_2, name="target_corpus")
Then you have an indexed series of one corpus (formerly the "list lookup").
Step 2. Get columns of score
and corpus_id
in your main dataframe
Step 3. Use pd.merge
to join the input corpus on corpus_id
vs the index of target_corpus
and using how="left"
(only items that match an existing corpus_id are relevant). This should be an efficient way to do it, and it's a whole-dataframe operation.
Develop and test the solution vs a small subset (1K) to iterate quicky then grow.