python mongodb real-time aggregation-framework mongodb-aggregation

Grouping of documents having the same phone number

My database consists of collection of a large no. of hotels (approx 121,000).

This is how my collection looks like :

{
    "_id" : ObjectId("57bd5108f4733211b61217fa"),
    "autoid" : 1,
    "parentid" : "P01982.01982.110601173548.N2C5",
    "companyname" : "Sheldan Holiday Home",
    "latitude" : 34.169552,
    "longitude" : 77.579315,
    "state" : "JAMMU AND KASHMIR",
    "city" : "LEH Ladakh",
    "pincode" : 194101,
    "phone_search" : "9419179870|253013",
    "address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
    "email" : "",
    "website" : "",
    "national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
    "area" : "Leh Ladakh",
    "data_city" : "Leh Ladakh"
}

Each document can have 1 or more phone numbers separated by "|" delimiter.

I have to group together documents having same phone number.

By real time, I mean when a user opens up a particular hotel to see its details on the web interface, I should be able to display all the hotels linked to it grouped by common phone numbers.

While grouping, if one hotel links to another and that hotels links to another, then all 3 should be grouped together.

Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C has phone numbers 2|3, then A, B and C should be grouped together.

from pymongo import MongoClient
from pprint import pprint #Pretty print 
import re #for regex
#import unicodedata

client = MongoClient()

cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []

#We'll be passing the value later as parameter via a function call 
#hId = 37443; 

regx = re.compile("^Vivanta", re.IGNORECASE)

#Connection
db = client.hotel
collection = db.hotelData

#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
    list.append(post)

#Copying all hotels in a list
for post1 in collection.find():
    allHotels.append(post1)

hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel

try:
    splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
    splitList = y


print "Contact details of",x,":"

#Printing all contacts...
for contact in splitList:   
    print contact 
    conContact.extend(contact)
    cLen = cLen+1

print "No. of contacts in",x,"=",cLen


for i in allHotels:
    yAll = allHotels[countA]["phone_search"]
    try:
        splitListAll.append(yAll.split("|"))
        countA = countA+1
    except:
        splitListAll.append(yAll)
        countA = countA + 1
#   print splitListAll

#count = 0 

#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll: 
    if contactAll in conContact:
        conContact.extend(contactAll)
#       contactChk = contactAll
#       if (set(conContact) & set(contactChk)):
#           conContact = contactChk
#           contactChk[:] = [] #drop contactChk list
        conId = allHotels[countB]["autoid"]
    countB = countB+1

print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
    print final

This is one code I wrote in Python. In this one, I tried performing linear search in a for loop. I am getting some errors as of now but it should work when rectified.

I need an optimized version of this as liner search has poor time complexity.

I am pretty new to this so any other suggestions to improve the code are welcome.

Thanks.

Solution

The easiest answer to any Python in-memory search-for question is "use a dict". Dicts give O(ln N) key-access speed, lists give O(N).

Also remember that you can put a Python object into as many dicts (or lists), and as many times into one dict or list, as it takes. They are not copied. It's just a reference.

So the essentials will look like

for hotel in hotels:
   phones = hotel["phone_search"].split("|")
   for phone in phones:
       hotelsbyphone.setdefault(phone,[]).append(hotel)

At the end of this loop, hotelsbyphone["123456"] will be a list of hotel objects which had "123456" as one of their phone_search strings. The key coding feature is the .setdefault(key, []) method which initializes an empty list if the key is not already in the dict, so that you can then append to it.

Once you have built this index, this will be fast

try:
    hotels = hotelsbyphone[x]
    # and process a list of one or more hotels
except KeyError:
    # no hotels exist with that number

Alternatively to try ... except, test if x in hotelsbyphone: