I want to N-gram a set of strings in MxNet. Perferably, I would do something like TFIDF Vectorizing, but even a simple N-gram with count and feature limits would be fine. Is there a built in function for this? What would be the best approach?
Currently, I am computing it with Python,
def tfidf(str_list, ngram_width=3):
tf = {}
for s in str_list:
for start, end in zip(range(len(s) - ngram_width),
range(ngram_width, len(s))):
if s[start:end] not in tf:
tf[s[start:end]] = 0
tf[s[start:end]] += 1
idf = {}
for t in tf.keys():
cnt = 0
for s in str_list:
if t in s:
cnt += 1
idf[t] = len(str_list)/(cnt + 1.0)
return {t:tf[t]*idf[t] for t in tf.keys()}
Let's step back and ask why we would traditionally represent text by n-grams. N-grams attempt to capture interesting collocations i.e. words grouped together e.g. "White House" as a bigram is potentially more interesting than just knowing the sentence contains the words "White" and "House".
The downside of using n-grams is increased sparsity -- many collocations have low frequency. We may encounter collocations at prediction time that were never seen before.
For Deep Learning, we can capture collocation and the interesting information encoded in language by the sequence of words using RNNs such as LSTM.
A typical way to handle textual input for Deep Learning would therefore be a Word2Vec encoding of the text with, say, an LSTM on top of it (or a BiLSTM to be more fancy).