I have three columns in my dataset. This is the list of restaurants that come under the category 'pizza'.This data was derived from the yelp dataset.There are three columns for each restaurant present. Latitude,Longitude,Checkins. I am supposed to build a model where I should be able to predict the coordinates(latitude,longitude) where I should start a new restaurant so that the number of checkins can be high. There are totally 4951 rows
checkins latitude longitude 0 2 33.394877 -111.600194 1 2 43.841217 -79.303936 2 1 40.442828 -80.186293 3 1 41.141631 -81.356603 4 1 40.434399 -79.922983 5 1 33.552870 -112.133712 6 1 43.686836 -79.293838 7 2 41.131282 -81.490180 8 1 40.500796 -79.943429 9 12 36.010086 -115.118656 10 2 41.484475 -81.921150 11 1 43.842450 -79.027990 12 1 43.724840 -79.289919 13 2 45.448630 -73.608719 14 1 45.577027 -73.330855 15 1 36.238059 -115.210341 16 1 33.623055 -112.339758 17 1 43.762768 -79.491417 18 1 43.708415 -79.475884 19 1 45.588257 -73.428926 20 4 41.152875 -81.358754 21 1 41.608833 -81.525020 22 1 41.425152 -81.896178 23 1 43.694716 -79.304879 24 1 40.442147 -79.956513 25 1 41.336466 -81.784790 26 1 33.231942 -111.721218 27 2 36.291436 -115.287016 28 2 33.641847 -111.995571 29 1 43.570217 -79.566431 ... ... ... ...
I tried to approach the problem with clustering using DBSCAN and ended with the following graph. But I am not able to make any sense of it. How do I Proceed further or how do I approach the problem in a different way to get my results?
import pandas as pd
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
review=pd.read_csv('pizza_category.csv')
checkin=pd.read_csv('yelp_academic_dataset/yelp_checkin.csv')
final=pd.merge(review,checkin,on='business_id',how='inner')
final.dropna()
final=final.reset_index(drop=True)
X=final[['checkins']]
X['latitude']=final[['latitude']].astype(dtype=np.float64).values
X['longitude']=final[['longitude']].astype(dtype=np.float64).values
print(X)
arr=X.values
db = DBSCAN(eps=2,min_samples=5)
y_pred = db.fit_predict(arr)
plt.figure(figsize=(20,10))
plt.scatter(arr[:, 0], arr[:, 1], c=y_pred, cmap="plasma")
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
This is not a clustering problem.
What you want to do is density estimation, where you estimate density based on previous check-in frequencies.