I'm dabbling with ML and was able to take a tutorial and get it to work for my needs. It's a simple recommender system using TfidfVectorizer and linear_kernel. I run into a problem with how I go about deploying it through Sagemaker with an end point.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import json
import csv
with open('data/big_data.json') as json_file:
data = json.load(json_file)
ds = pd.DataFrame(data)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['content'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {}
for idx, row in ds.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]
results[row['id']] = similar_items[1:]
def item(id):
return ds.loc[ds['id'] == id]['id'].tolist()[0]
def recommend(item_id, num):
print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")
print("-------")
recs = results[item_id][:num]
for rec in recs:
print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
recommend(item_id='129035', num=5)
As a starting point I'm not sure if the output from tf.fit_transform(ds['content'])
is considered the model or the output from linear_kernel(tfidf_matrix, tfidf_matrix)
.
I came to the conclusion that I didn't need to deploy this through SageMaker. Since the final linear_kernel output was a Dictionary I could do quick ID lookups to find correlations.
I have it working on AWS with API Gateway/Lambda, DynamoDB and an EC2 server to collect, process and plug the data into DynamoDB for fast lookups. No expensive SageMaker endpoint needed.