My database consists of collection of a large no. of hotels (approx 121,000).
This is how my collection looks like :
{
"_id" : ObjectId("57bd5108f4733211b61217fa"),
"autoid" : 1,
"parentid" : "P01982.01982.110601173548.N2C5",
"companyname" : "Sheldan Holiday Home",
"latitude" : 34.169552,
"longitude" : 77.579315,
"state" : "JAMMU AND KASHMIR",
"city" : "LEH Ladakh",
"pincode" : 194101,
"phone_search" : "9419179870|253013",
"address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
"email" : "",
"website" : "",
"national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
"area" : "Leh Ladakh",
"data_city" : "Leh Ladakh"
}
Each document can have 1 or more phone numbers separated by "|" delimiter.
I have to group together documents having same phone number.
By real time, I mean when a user opens up a particular hotel to see its details on the web interface, I should be able to display all the hotels linked to it grouped by common phone numbers.
While grouping, if one hotel links to another and that hotels links to another, then all 3 should be grouped together.
Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C has phone numbers 2|3, then A, B and C should be grouped together.
from pymongo import MongoClient
from pprint import pprint #Pretty print
import re #for regex
#import unicodedata
client = MongoClient()
cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []
#We'll be passing the value later as parameter via a function call
#hId = 37443;
regx = re.compile("^Vivanta", re.IGNORECASE)
#Connection
db = client.hotel
collection = db.hotelData
#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
list.append(post)
#Copying all hotels in a list
for post1 in collection.find():
allHotels.append(post1)
hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel
try:
splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
splitList = y
print "Contact details of",x,":"
#Printing all contacts...
for contact in splitList:
print contact
conContact.extend(contact)
cLen = cLen+1
print "No. of contacts in",x,"=",cLen
for i in allHotels:
yAll = allHotels[countA]["phone_search"]
try:
splitListAll.append(yAll.split("|"))
countA = countA+1
except:
splitListAll.append(yAll)
countA = countA + 1
# print splitListAll
#count = 0
#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll:
if contactAll in conContact:
conContact.extend(contactAll)
# contactChk = contactAll
# if (set(conContact) & set(contactChk)):
# conContact = contactChk
# contactChk[:] = [] #drop contactChk list
conId = allHotels[countB]["autoid"]
countB = countB+1
print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
print final
This is one code I wrote in Python. In this one, I tried performing linear search in a for loop. I am getting some errors as of now but it should work when rectified.
I need an optimized version of this as liner search has poor time complexity.
I am pretty new to this so any other suggestions to improve the code are welcome.
Thanks.
The easiest answer to any Python in-memory search-for question is "use a dict". Dicts give O(ln N) key-access speed, lists give O(N).
Also remember that you can put a Python object into as many dicts (or lists), and as many times into one dict or list, as it takes. They are not copied. It's just a reference.
So the essentials will look like
for hotel in hotels:
phones = hotel["phone_search"].split("|")
for phone in phones:
hotelsbyphone.setdefault(phone,[]).append(hotel)
At the end of this loop, hotelsbyphone["123456"]
will be a list of hotel objects which had "123456" as one of their phone_search
strings. The key coding feature is the .setdefault(key, [])
method which initializes an empty list if the key is not already in the dict, so that you can then append to it.
Once you have built this index, this will be fast
try:
hotels = hotelsbyphone[x]
# and process a list of one or more hotels
except KeyError:
# no hotels exist with that number
Alternatively to try ... except
, test if x in hotelsbyphone: