node.js google-app-engine stackdriver google-cloud-stackdriver

StackDriver Custom Metric for Nodejs event loop latency

I am trying to construct a custom metric for Google StackDriver that I can use to track nodejs event loop latencies. All the apps are running in Google AppEngine so I am confined to using the monitored resource global (at least to my understanding).

Via the nodejs @google/monitoring client I have created a metric descriptor looking like this:

{
  name: client.projectPath(projectId),
  metricDescriptor: {
    description: 'Nodejs event loop latency',
    displayName: 'Event Loop Latency',
    type: 'custom.googleapis.com/nodejs/eventloop/latency',
    metricKind: 'GAUGE',
    valueType: 'DOUBLE',
    unit: '{ms}',
    labels: [
      {
        key: 'instance_id',
        valueType: 'STRING',
        description: 'The ID of the instance reporting latency (containerId, vmId, etc.)',
      },
    ],
},

And writing data to this custom metric like:

metric: {
    type: 'custom.googleapis.com/nodejs/eventloop/latency',
    labels: {
      instance_id: instanceId,
    },
  },
  resource: {
    type: 'global',
    labels: {
      project_id: projectId,
    },
  },
  points: [{
    interval: {
      endTime: {
        seconds: item.at,
      },
    },
    value: {
      doubleValue: item.value,
    },
  }],
};

I thought all was good while writing my tests, until I tried changing my instance_id to write data that was in an overlapping timespan as another fake instance had already written. Now the monitor client throws the error

Error: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified was older than the most recent stored point.

Which renders my custom metric VERY useless, only one nodejs process can ever write to this custom metric.

Now my question is, how can I circumvent this? I want to be able to write from all of my nodejs instances running (x AppEngine services with y instances running).

I was thinking a type that is indexed on nodejs/eventloop/latency/{serviceName}/{serviceVersion}/{instanceId} but it seems a bit extreme and will quickly bring me towards the quotas on the StackDriver account.

Any suggestions are highly appreciated!

Solution

Time series data for custom metrics in Stackdriver must be written time in-order as documented in https://cloud.google.com/monitoring/custom-metrics/creating-metrics#which-resource.

A workaround for this is to create a separate time series for every instance writing to the metric by adding a user-defined label for the instance_id. You can also add separate labels for service_name or service_version, if you require it. However, be mindful of the cardinality of the label values. Creating too many timeseries on a single metric can degrade query performance.

More details on what a time series is: See https://cloud.google.com/monitoring/api/v3/metrics-details#intro-time-series.