ServiceNow Table API returning duplicates / not returning all records

I am attempting to pull ALL data out of the ServiceNow CMDB data tables into my own SQL database. I plan on using this for automated updates when data is being maintained in a customer legacy system which needs to be sync'd with the new SN implementation.

Our SN environment has many domains, many customers, etc. Which translates to a huge amount of data. Attempting to pull ALL records in a single API call naturally results in a timeout exception as the data is too large. The Web API suggests batching data, which is what I have implemented:

        ...
        var offset = 0;

        while (true)
        {
            var requestUrl = string.Format("{0}/api/now/table/cmdb_ci_vm_instance?sysparm_offset={1}&sysparm_limit={2}&sysparm_exclude_reference_link=true", url, offset, batchSize);
            var result = ApiRequest.Get(new Uri(requestUrl), _credential, null, out response);
            if (result == HttpStatusCode.OK)
            {
                var virtualmachineRo = JsonConvert.DeserializeObject<VirtualMachineRootObject>(response);

                if (virtualmachineRo.result.Count < batchSize)
                {
                    virtualmachines.AddRange(virtualmachineRo.result);
                    break;
                }

                virtualmachines.AddRange(virtualmachineRo.result);
                offset += virtualmachineRo.result.Count;
            }
            else
            {
                break;
            }
        }
        ....

I have worked with different batch sizes and it gives me different results. Depending on which table I am hitting, sometimes I get duplicates, sometimes I don't, sometimes I get different duplicates. In almost all cases, when I DO get a duplicate, the duplicate replaces a valid record in the stream, which means I am 'missing' a valid record.

I have confirmed this by looking at the raw data stream coming from serviceNow and seeing the duplicates, and verifying that there are records in ServiceNow that did not show up in my stream (which is typically discovered when I try to create a missing record and SN responds with "ignored - no field changed" indicating that the record is already there.)

I have attempted to search and find anyone with similar issues but have not found anything. Prior to opening up a ticket with SN directly I was hoping to see if there was something I was doing incorrectly with my code, or if this seems to be what I expect - a platform related bug.

Solution

So this is almost certainly a bug in their API related to service account access. Here is what I have done, and what I believe to be occuring (and my work around!)

After playing around with some different logic, I encountered the following:

{0}/api/now/table/cmdb_ci_vm_instance?sysparm_offset=0&sysparm_limit=750&sysparm_exclude_reference_link=true&sysparm_query=ORDERBYsys_id
2016-08-19 12:46:15,277 [INFO ] Returned Records Count: 750

{0}/api/now/table/cmdb_ci_vm_instance?sysparm_offset=750&sysparm_limit=750&sysparm_exclude_reference_link=true&sysparm_query=ORDERBYsys_id
2016-08-19 12:46:28,243 [INFO ] Returned Records Count: 746

Notice on the second pass I asked for 750 records, but only got 746 (A difference of 4.) The other batches continued to return 750 as normal, and in the end I wound up with exactly 4 duplicates out of 5200+ total records retrieved. Because the second pass was returning less than the 750 I asked for, my code was breaking since it thought the data was complete...

In order to get around this (initially) I did the following to the code above:

if (virtualmachineRo.result.Count == 0)
{
    //AddRange(virtualmachineRo.result);
    break;
}

I no longer trust that just because the number of records returned was less than the sysparm_limit that I have retrieved all records.

It looks like when the data is pulled from the database, a separate process filters the data based on your permissions. Therefore, the second batch removes 4 records - but the database doesn't know any better... so on the next pass when I incremented my offset by the actual count returned from the 746 call, it returns the same data set and the last 4 records of the batch are duplicated as the first 4 records in the next page of data. (Confirmed in the data stream - when ordering the data, it causes the duplicates to appear in different places and different records are duplicated based on batch size/order as well!)

I can't believe this is the intended behavior and it is certainly not documented in the Wiki. Either way, the workaround is simple for anyone running into this: You have to continue pulling data even if the count returned is less than the batchsize and continue until 0 records are returned. Then you have to filter out the duplicates with something simple like:

yourrecords.GroupBy(x => x.sys_id).Select(x => x.First()).ToList();

And problem is 'solved.'