For a given list of unique items, after sorting them based on a few columns in batches, I am getting duplication of items.
There are 2 tables:
- item_popularity_tbl:
iid (UINT, PK) pplt (UTINYINT)
- item_cat_id_tbl:
iid (UINT, PK) cid (UTINYINT)
Where:
iid: item ID, unique values in the tables
cid: cat ID, values in the range (1, 15). Multiple items can have same cid.
pplt: popularity, vals in the range (1, 10). Multiple items can have same pplt.
Given a list of item ids, all unique, I need to order it by cid ASC and then pplt DESC.
I use the below SQL (using MySQL) code to achieve it:
# python code snippet
def db_get_items_sorted(conn, iid_list, offset, limit):
n1 = ','.join( ['%s'] * len(iid_list) )
#
sql_stmt = ("SELECT a.iid, a.cid, b.pplt "
"FROM pclg.item_cat_id_tbl AS a "
"INNER JOIN pclg.item_popularity_tbl AS b ON b.iid=a.iid "
"WHERE a.iid IN (%s) "
"ORDER BY a.cid ASC, "
"b.pplt DESC "
"LIMIT %s,%s") % (n1, offset, limit)
#
return conn.query(sql_stmt % tuple(iid_list))
I call db_get_items_sorted(...) in batches of 10. I tested this with input iid_list of 58 items (unique), against fully populated tables with unique iids. For each call, mysql returns unique 10 iids with proper sorting, as requested. But when I concatenate the batches to create the final 58 items, I notice that several items (iid) in the list are duplicated (and for those, all columns returned are same, basically complete duplication). For eg, 48 unique rows and 10 duplicates.
If I call db_get_items_sorted(...) with offset 0 and limit 58, there are no iid duplicates. However, I need to call in small batches as we can potentially do this sorting for several thousands of items, going forward.
Question is: how to achieve iid uniqueness across batches?
I didn't like to use DISTINCT to solve the problem. I wasn't sure why the duplication of the rows was happening in the first place, hence wanted to address that.
Below is the fix, using a column iid with unique vals, in the ORDER BY:
sql_stmt = ("SELECT a.iid, a.cid, b.pplt "
"FROM pclg.item_cat_id_tbl AS a "
"INNER JOIN pclg.item_popularity_tbl AS b ON b.iid=a.iid "
"WHERE a.iid IN (%s) "
"ORDER BY a.cid ASC, "
"b.pplt DESC, a.iid DESC "
"LIMIT %s,%s") % (n1, offset, limit)
TL;DR: The duplication was happening as we were using LIMIT along with ORDER BY and the columns used in ORDER BY didn't have unique vals.
Reasons:
LIMIT returns the moment it finds out the required count of rows that satisfy the query.
When a column with non-unique vals is used with ORDER BY, the rows holding the same vals of those columns do not need to have deterministic ordering.
So, using LIMIT and calling this routine multiple times (to finally concatenate them to produce the final result), mysql was running the ORDER BY several times and rows with same val for columns (used in ORDER BY) were non-ordered across those calls -- ie their positions were not constant. LIMIT M,N however returns rows from the specific (M, N) window of the ORDER BY result.
This resulted in the final concatenated result holding duplicate rows.
Ref: https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html