Search code examples
node.jsgoogle-cloud-firestorenosqlbackendquery-optimization

Efficiently Querying Firestore for Newsfeed with Large Followed User Lists to Reduce Costs


I’m working on a Firestore-based social app where users can see a newsfeed of posts. The feed includes:

  1. Public posts (sourcePrivacyStatus === PUBLIC).
  2. Newsfeed type posts (newsfeedType === NEWS_FEED).
  3. Followers-only posts (sourcePrivacyStatus === FOLLOWERS_ONLY) if the creator (added_by) is in the user's followed list.

Problem To query for followers-only posts, I use the in operator to match added_by against the list of followed user IDs (followedIds). However, Firestore has a limitation where the in operator can handle a maximum of 10 values.

In cases where a user follows hundreds or thousands of users, this requires splitting followedIds into chunks of 10 and running multiple queries. For example:

For 1000 followed IDs, I would need 100 queries. This significantly increases the read costs and query execution time.

Current Approach Here’s how I currently handle the query:

const chunkArray = (array: any[], chunkSize: number) => {
  const chunks = [];
  for (let i = 0; i < array.length; i += chunkSize) {
    chunks.push(array.slice(i, i + chunkSize));
  }
  return chunks;
};

export const retrieveFollowedNewsfeedsQuery = async (
  followedIds: string[],
  lastTimeStamp: FirebaseFirestore.Timestamp | null = null,
  limit: number = 20
) => {
  const idChunks = chunkArray(followedIds, 10); // Split IDs into chunks of 10
  const queries = idChunks.map((ids) => {
    let query = db
      .collection("newsfeed")
      .where(
        Filter.or(
          Filter.where("sourcePrivacyStatus", "==", CoffeeStatus.PUBLIC),
          Filter.where("newsfeedType", "==", NewsfeedTypes.NEWS_FEED),
          Filter.and(
            Filter.where("sourcePrivacyStatus", "==", CoffeeStatus.FOLLOWERS_ONLY),
            Filter.where("added_by", "in", ids)
          )
        )
      )
      .orderBy("timeStamp", "desc");

    if (lastTimeStamp) {
      query = query.startAfter(lastTimeStamp);
    }

    return query;
  });

  const results = [];
  for (const query of queries) {
    const snapshot = await query.get();
    snapshot.forEach((doc) => {
      results.push({ id: doc.id, ...doc.data() });
    });
  }

  results.sort((a, b) => b.timeStamp.toMillis() - a.timeStamp.toMillis());
  return results.slice(0, limit); // Return only the required number of results
};

Challenges with This Approach

  1. High Cost: Running 100+ queries for large follower lists leads to high read costs.
  2. Inefficiency: Sorting and merging results from multiple queries increases latency.
  3. Rate Limits: Firestore has rate limits that might be exceeded with many concurrent queries.

I’m looking for a cost-effective and efficient way to implement this functionality.


Solution

  • Not a direct answer but a different approach.

    We had a similar issue with complex queries in Firestore. So we ended up synching related collections to an ElasticSearch service, which is absolutely marvelous at querying, sorting and aggregating results.

    We use ElasticSearch to query for:

    • Dashboards, with a lot of aggregated data.
    • Lists, where the user can select different filters and sorting. Dynamically selecting properties for a filtered list requires Firestore to create all possible combinations, which is not only impractical but a heavy resource consumer.

    We use Firestore for all the rest, specially those "detail pages" of a single entity where we can subscribe to the specific single documents needed, and get real-time updates in front-end. Users love that.

    To ease and secure the ElasticSearch integration from the client, we set up an onCall Firebase Function which acts like a proxy to our ElasticSearch service.

    To sync the documents to ElasticSearch, we set up an onDocumentWritten() trigger on the needed collections, making sure to use versioning when synching.

    I guess you can do the same with BigQuery. It claims to also work with unstructured data, but I haven't tried it, though.

    Regarding:

    Firestore has a limitation where the in operator can handle a maximum of 10 values

    in operator limit has been raised to 30. Not that it helps your problem but for you to know, it can save you some queries in the meantime.

    In case you decide to stick to Firestore in operator, you could limit to the 30 most relevant followed users. For that, you'd need to update the following-user (A) each time the followed-user (B) does something relevant, whatever "relevant" means for your service.

    For example, each time user B writes a post, update a timestamp in user B's entry in user A's followed list. Then user A can filter their followed list by recent activity, and use only those 30 with the in operator.