I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following: For each key in dict: 1. Get bigrams for the pandas column and the dict[key] 2. Calculate Jaccard similarity 3. Append to an empty list 4. Store the list in the dataframe 5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:
So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed
is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer
and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support
is the boolean array, that measures the union of the n-grams over both comparable. intersection
is only True if a token is present in both comparable. Note that .A1
represents a matrix
-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df
and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615