firebase google-cloud-platform google-cloud-firestore ranking

Multi parameter ranking using firebase

How can I leverage Firebase to develop a mixer algorithm similar to Twitter, which retrieves and ranks discussions from Firestore based on weight and created_at parameters?

I have a discussion collection with the following structure:

interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

Challenge:

In Firestore, ordering data by a single field poses a limitation. For example, if we order discussions solely by weight, new posts will never have the opportunity to rise up in the ranking.

If I attempt to order discussions separately by weight and created_at, how can I handle deduplication effectively?

It's important to consider that the discussion documents can vary from 0 to 1 million. Therefore, I prefer a solution that avoids loading all the documents on the client side. Additionally, any changes made must be reactive and utilize the onSnapshot method for real-time updates.

Example Scenario:


interface Discussion {
    weight: number;
    created_at: ServerTimeStamp;
}

async function queryDiscussionFromFireStore () { 
   const col_ref = collection("discussion")
   // query top discussions
   const topPost_unSub = onSnapShot(query(col_ref, orderby("weight"), 
   (snapShot) => {
       setState(snapShort.doc.map (d => d.data() as Array<Discussion>)    
   })
   
   // query recent discussions
   const recentPost_unSub = onSnapShot(query(col_ref, orderby("created_at"), 
   (snapShot) => {
       setState(snapShort.doc.map (d => d.data() as Array<Discussion>)    
   })

   return () => {
     recentPost_unSub()
     topPost_unSub()
   };
}

queryDiscussionFromFireStore is working fine but i'm not able to figure out how to handle duplicate data.

let suppose we have following data:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

In this case both snapShot will response with same data.

Explanation

In the provided code example, the queryDiscussionFromFirestore function retrieves discussions from Firestore by ordering them based on two criteria: weight and created_at. The function uses the onSnapshot method to listen for real-time updates on the queried discussions.

However, there is a concern regarding duplicate data. In the given scenario, if multiple discussions have the same created_at timestamp, both the "top discussions" query (ordered by weight) and the "recent discussions" query (ordered by creation time) may return the same data.

For instance, considering the following example data:

[
    {
        weight: 5,
        created_at: today_date
    },
    {
        weight: 3,
        created_at: today_date
    },
]

In this case, both onSnapshot callbacks for the "top discussions" and "recent discussions" queries will receive the same data, which results in duplicate entries being processed.

Solution

From the Firestore documentation on its query limitations:

In a compound query, range (<, <=, >, >=) and not equals (!=, not-in) comparisons must all filter on the same field.

So each query can only have range filters on a single field, and there is no way to order/filter top results on multiple fields in a single query. You will have to perform multiple queries and deduplicate the results in your application code.

That also means that there is no way to prevent the extra reads. Theoretically, you could find a way to merge the created_at and weight into a single value/property that you can filter on to meet your requirements, but the only real example of something like that that I know of are geohashes (which combine the lat/lon values of a point into a single string value that you can filter on to find documents in a region), and I personally don't see an equivalent here.