I have an MSSQL stored procedure which returns 4 selections to me: Entities
, Certificates
, Contacts
and Logs
. I need to combine these 4 selections in Pyton where a I put all Entities
, Contacts
and Logs
under their Certificate
. Each of these selections has an EntityId
I can use for the merge.
The inputs are lists of simple, basic dataclasses containing the information from SQL. We convert these dataclasses into dictionaries inside the merging function.
When I originally wrote the code, I had no idea that the selections could be very large (100.000s of Certificates
with all their other records). Unfortunately this made the code below very inefficient due to the many unnecessary iterations of the list comprehensions inside the loop. It can take up to 70 seconds. I am sure that there is a way to make this much faster. How do I improve the performance to being as efficient as possible?
from dataclasses import asdict
def cert_and_details(entities: List[Entity],
certificates: List[Certificate],
req_logs: List[DocumentRequestHistory],
recipients: List[Recipient]) -> List[dict]:
entities = [asdict(ent) for ent in entities]
certificates = [asdict(cert) for cert in certificates]
req_logs = [asdict(log) for log in req_logs]
recipients = [asdict(rec) for rec in recipients]
results = []
for cert_dict in certificates:
cert_entity_id = cert_dict["entityid"]
logs_under_cert = [log for log in req_logs if log["entityid"] == cert_entity_id]
cert_dict["logs"] = logs_under_cert
entities_under_cert = [ent for ent in entities if ent["entityid"] == cert_entity_id]
cert_dict["linkedentity"] = entities_under_cert
recipients_under_cert = [rec for rec in recipients if rec["entityid"] == cert_entity_id]
cert_dict["recipients"] = recipients_under_cert
results.append(cert_dict)
return results
The main issue with the provided code is its computational complexity: it runs in O(C * (L + E + R))
where C
is the number of certificates, L
the number of logs, E
the number of entities and R
the number of recipients. This is fine if L+E+R
is small, but if this is not the case, then the code will be slow.
You can write an implementation running in O(C + L + E + R)
time. The idea is to build an index first to group logs/entities/recipients by entity ID. Here is a short example:
# Note: defaultdict should help to make this code smaller (and possibly faster)
logIndex = dict()
for log in req_logs:
entityId = log["entityid"]
if entityId in logIndex:
logIndex[entityId].append(log)
else:
logIndex[entityId] = [log]
This code runs in (amortized) O(L)
. You can then retrieve all the items in req_log
with a given entity ID using just logIndex[entityId]
.
There is another issue in the provided code: lists of dictionaries are inefficient: a dictionary indexing is slow and dictionaries are not memory efficient either. A better way to store and compute data could be to use dataframes (e.g. with Pandas which also provide a relatively optimized groupby
function).