python pandas dataframe machine-learning feature-extraction

How to build features using pandas column and a dictionary efficiently?

I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?

Following are the steps I am currently following: For each key in dict: 1. Get bigrams for the pandas column and the dict[key] 2. Calculate Jaccard similarity 3. Append to an empty list 4. Store the list in the dataframe 5. Convert the list to columns

from itertools import tee, islice

def count_ngrams(lst, n):
    tlst = lst
    while True:
        a, b = tee(tlst)
        l = tuple(islice(a, n))
        if len(l) == n:
            yield l
            next(b)
            tlst = b
        else:
            break

def n_gram_jaccard_similarity(str1, str2,n):

    a = set(count_ngrams(str1.split(),n))
    b = set(count_ngrams(str2.split(),n))

    intersection = a.intersection(b)
    union = a.union(b)

    try:
        return len(intersection) / float(len(union))

    except:
        return np.nan

def jc_list(sample_dict,row,n):
    sim_list = []
    for key in sample_dict:
       sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))

    return str(sim_list)

Using the above functions to build the bigram Jaccard similarity features as follows:

df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)

Sample input:

df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]

import collections

sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1" 
sample_dict["r2"] = "is sample" 
sample_dict["r3"] = "sample text 2"

Expected output:

Solution

So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.

I added an additional text row to the frame:

df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]

import collections

sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1" 
sample_dict["r2"] = "is sample" 
sample_dict["r3"] = "sample text 2"

We will use the following modules/functions/classes:

from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np

and define a CountVectorizer to create character based n_grams

ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")

feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)

We concatenate all the data:

all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))

we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.

Now let's fit the Count vectorizer and transform the data accordingly:

ngrammed = ngram_vectorizer.fit_transform(all_data) >0

ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.

Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:

texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)

jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
    repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
    support = texts.maximum(repeated_row_matrix)
    intersection = texts.multiply(repeated_row_matrix).todense()
    jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))

support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.

Now

pd.concat(jaccard_similarities, axis=1)

gives

         r1        r2        r3
0  0.631579  0.444444  0.500000
1  0.480000  0.333333  0.384615

you can concat is as well to df and obtain with

pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
  id                          text        r1        r2        r3
0  1         this is a sample text  0.631579  0.444444  0.500000
1  2  this is a second sample text  0.480000  0.333333  0.384615