Search code examples
microsoft-graph-apionedrive

Microsoft Graph API Fetching Large Quantity of Small Documents From OneDrive Returns Duplicates


I'm working on an app that uses the Microsoft Graph API to fetch the contents of OneDrive based on a supported search criteria, (Modified Date Range, Keyword, Author...etc). When we started testing we ran into a crippling issue, we are not getting all of the expected documents back from Microsoft Graph API, some of those documents are duplicates. In our case, we expected 10k unique documents to be returned, we got 10k documents back but 1,098 of those documents were duplicates, so we are missing 1,098 unique documents on the request.

The frustrating part is that the Microsoft Graph Api LIES to you on the response. It tells us the correct expected "total" document count, but when we analyzed what we actually got back, about 11% were duplicate documents.

We initially thought that the issue was with the Microsoft Graph SDK we were using for fetching in our app, but then we decided to use the Online Microsoft Graph Explorer and analyze the JSON responses to really verify and the same issue occurs.

What we tried:

  • We created a folder containing 10k unique text documents with the modified date range of April 15, 2024.

  • All the files are unique 1kb - 5kb text files, no duplicates on the original set.

  • Then we uploaded those files to our test OneDrive account using the OneDrive Windows Client. Because of the quantity of files this took a few days.

  • We then waited over a week after the files were uploaded to give it enough time for files to be synched on the Microsoft Graph API service.

  • We then went to the Online Microsoft Graph Explorer, logged in as our test user, and selected search driveitem from sample queries:

  • Selecting the search driveitem query, makes a POST request to "https://graph.microsoft.com/v1.0/search/query"

We created the following query:

{
    "requests": [
        {
            "entityTypes": [
                "driveItem"
            ],
            "query": {
                "queryString": "lastModifiedTime>=2024-04-10 AND lastModifiedTime<=2024-04-20"
            },
            "from": 0,
            "size": 100,
            "fields": [
                "id"
            ]
        }
    ]
}

The query above is searching for all OneDrive items that have a modified date between April 10th, 2024 - April 20th, 2024. The request is paginated to 100 documents per request, so we get the first 100 documents on the first request, the we manually increate the "from" property by increments of 100 so it starts at 0 -> 100 -> 200 ... etc. We are limiting what we get back from the API to only "id" because we are only interested in analyzing document ids to find duplicates. a sample response looks like this:

{
    "value": [
        {
            "searchTerms": [],
            "hitsContainers": [
                {
                    "hits": [
                        {
                            "hitId": "01S4RUMYPHXG4DV57DVVBKNFDS45C5OXHY",
                            "rank": 1,
                            "summary": "Category:Bombay Bicycle Club albums 14 27915462 593764600 517040894 2014-02-03T18:39:40Z Starcheerspeaksnewslostwars 11554556 removed [[Category:Indie rock albums by British<ddd/>",
                            "resource": {
                                "@odata.type": "#microsoft.graph.driveItem",
                                "listItem": {
                                    "@odata.type": "#microsoft.graph.listItem",
                                    "id": "3ab8b9e7-e3f7-42ad-a694-72e745d75cf8",
                                    "fields": {
                                        "id": "AAAAADhW4K7ZoZ9IslsSHKvPpasHAF4fqM0LdMROqdU8vNuOKqoAAE0mEeYAAO0jEMpdpCFHp_lSL-72PpUAAPdT2ZwAAA2"
                                    }
                                },
                                "id": "01S4RUMYPHXG4DV57DVVBKNFDS45C5OXHY"
                            }
                        },
// ...etc until the 100th file
                    ],
                    "total": 10000,
                    "moreResultsAvailable": false
                }
            ]
        }
    ],
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#Collection(microsoft.graph.searchResponse)"
}

Note: it properly identified we have 10,000 files in the "total" property of the request. So the graph api knows for a fact there are suppose to be 10k unique files!!!

To analyze what we got, we manually (yes manually :'( !) copied response content of value.hitContainers[0].hits array and pasted it into one big hits array in our IDE. Then we incremented the "from" property by 100 to get the next 100 files in the request until we get all 10k files in the request. This was exhausting but we had to verify that the SDK wasn't the issue.

then to analyze if there were duplicates in our code we ran the following JS code to itereate over the hits array containing 10k entries

const hits = [ ... really large array containing 10k hits from microsoft graph api response]

const unique = new Set();

const hitIds = hits.map(element => element.hitId);

const duplicates = hitIds.filter(hitId => {
    if (unique.has(hitId)) {
        return hitId;
    } else {
        unique.add(hitId);
    }
});

console.log(unique.size, duplicates.length); // Output: 8902 1098

With the above code we verified that we have duplicates coming from the Microsoft Graph API itself and not the Microsoft Graph SDK. This is frustrating because that 11% of files being duplicates is significant and the user could lose alot of files during a fetching/transfer.

We tried to use the "trimDuplicates" property on the request and had the same result. We then thought we were fetching too many files at once and we changed the pagination from increments of 100 to increments of 50, and increments of 25 and the same result came back. We did this manually just to verify that the issue was with microsoft graph api. We really dont know were to go from here.

We have not been able to find much regarding this topic and are hoping we can get help here. Is there anything else we can do to mitigate or fix this issue? We are open to any suggestions from users who know how to use the Microsoft Graph API.


Solution

  • When querying a large number of search results, it's recommended to use indexDocId instead of the from property for efficient pagination of large result sets.

    • First, sort the results by [DocId] in ascending order
    • Retrieve the indexDocId value of the last entry in the result
    • For subsequent pages, use an increasing indexDocId restriction in your query (e.g., indexDocId>10, where 10 is the indexDocId of the last value from the previous page).
    • Repeat this pattern for each page.

    Here, you can find more details

    https://www.techmikael.com/2023/08/how-to-paginate-large-results-sets-for.html

    https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets