kubernetes versioning istio semantic-versioning servicemesh

How to deal with breaking changes in a Service Mesh

I'm building an example microservice application with Kubernetes to find out the best practices and some patterns for future projects. I'm using Istio as a Service Mesh to handle east-west traffic and I have a basic understanding of the concepts (VirtualServices, DestinationRules, ...). The service mesh allows me to easily push out new versions of a microservice and redirect the traffic to the new instance (using e.g. weighted distribution). When having semantic versioning in mind, this works really well for Patch and Minor updates, because they, in theory, didn't alter the existing contract and can therefore be a drop-in replacement for the existing service. Now I'm wondering how to properly deal with breaking changes of service, so a Major version update.

It's hard to find information for this, but with the limited info I got, I'm now thinking about two approaches:

Each major version of a service (e.g. user-service) gets its own VirtualService so that clients can address it correctly (by a different service name, e.g. user-service-v1). Istio is then used to correctly route the traffic for a major version (e.g. 1.*) to the different available services (e.g. user-service v1.3.1 and user-service v1.4.0).
I use one overall VirtualService for a specific microservice (so e.g. user-service). This VirtualService contains many routing definitions to use e.g. a header sent by the client (e.g. x-major-version=1) to match the request to a destination.

Overall there is not too much difference between both methods. The client obviously needs to specify to which major version he wants to talk, either by setting a header or by resolving a different service name. Are there any limitations to the described methods which make one superior to the other? Or are there other options I'm totally missing? Any help and pointers are greatly appreciated!

Solution

TLDR

Besides what I mentioned in comments, after a more detailed check of the topic, I would choose approach 2, with one overall Virtual Service for specific microservice with canary deployment and mirroring.

Approach 1

As mentioned in documentation

In situations where it is inconvenient to define the complete set of route rules or policies for a particular host in a single VirtualService or DestinationRule resource, it may be preferable to incrementally specify the configuration for the host in multiple resources. Pilot will merge such destination rules and merge such virtual services if they are bound to a gateway.

So in theory you could go with approach number 1, but I would say that there is too much configuration with that and there is better idea to do that.

Let's say you have old app with name v1.3.1 and new app with name v1.4.0, so appropriate Virtual Service would look as follow.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: vs-vervice1
spec:
  hosts:
  - '*'
  http:
  - name: "v1.3.1"
    route:
    - destination:
        host: service1.namespace.svc.cluster.local

---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: vs-service2
spec:
  hosts:
  - '*'
  http:
  - name: "v1.4.0"
    route:
    - destination:
        host: service2.namespace.svc.cluster.local

Approach 2

In practise I would go with approach number 2, for example you can create 2 versions of your app, in below example it's old and new and then configure Virtual Service and Destination Rules for it.

The question here would be, why? Because it's easier to manage, at least for me, and it's easy to use canary deployment and mirroring here, more about that below.

Let's say you deployed new app, you wan't to send 1% of incoming traffic here, additionally you can use mirror, so every request which goes to old service will be mirrored to new service for testing purposes.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: vs-vervice
spec:
  hosts:
  - '*'
  http:
  - name: "old"
    route:
    - destination:
        host: service.namespace.svc.cluster.local
        subset: v1
      weight: 99
    mirror:
      host: service.namespace.svc.cluster.local
      subset: v2
    mirror_percent: 100
  - name: "new"
    route:
    - destination:
        host: service.namespace.svc.cluster.local
        subset: v2
      weight: 1

---


apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: reviews-destination
spec:
  host: service.namespace.svc.cluster.local
  subsets:
  - name: v1
    labels:
      version: v1  <--- label on old pod
  - name: v2
    labels:
      version: v2  <--- label on new pod

Testing new application

The client obviously needs to specify to which major version he wants to talk, either by setting a header or by resolving a different service name.

Actually that depends on the configuration, if you use above option with new and old versions, then that's what canary deployment, e.g. weighted distribution, is used for. You can specify percentage amount of traffic which should be sent to new version of your app. Of course you can specify headers or prefixes in your Virtual Service so that users could use an older or newer version of your app.

Canary Deployment

As mentioned here

One of the benefits of the Istio project is that it provides the control needed to deploy canary services. The idea behind canary deployment (or rollout) is to introduce a new version of a service by first testing it using a small percentage of user traffic, and then if all goes well, increase, possibly gradually in increments, the percentage while simultaneously phasing out the old version. If anything goes wrong along the way, we abort and rollback to the previous version. In its simplest form, the traffic sent to the canary version is a randomly selected percentage of requests, but in more sophisticated schemes it can be based on the region, user, or other properties of the request.

Depending on your level of expertise in this area, you may wonder why Istio’s support for canary deployment is even needed, given that platforms like Kubernetes already provide a way to do version rollout and canary deployment. Problem solved, right? Well, not exactly. Although doing a rollout this way works in simple cases, it’s very limited, especially in large scale cloud environments receiving lots of (and especially varying amounts of) traffic, where autoscaling is needed.

istio

With Istio, traffic routing and replica deployment are two completely independent functions. The number of pods implementing services are free to scale up and down based on traffic load, completely orthogonal to the control of version traffic routing. This makes managing a canary version in the presence of autoscaling a much simpler problem. Autoscalers may, in fact, respond to load variations resulting from traffic routing changes, but they are nevertheless functioning independently and no differently than when loads change for other reasons.

Istio’s routing rules also provide other important advantages; you can easily control fine-grained traffic percentages (e.g., route 1% of traffic without requiring 100 pods) and you can control traffic using other criteria (e.g., route traffic for specific users to the canary version). To illustrate, let’s look at deploying the helloworld service and see how simple the problem becomes.

There is an example.

Mirroring

Second thing often used to test new version of application is traffic mirroring.

As mentioned here

Using Istio, you can use traffic mirroring to duplicate traffic to another service. You can incorporate a traffic mirroring rule as part of a canary deployment pipeline, allowing you to analyze a service's behavior before sending live traffic to it.

If you're looking for best practices I would recommend to start with this tutorial on medium, because it is explained very well here.

How Traffic Mirroring Works

Traffic mirroring works using the steps below:

You deploy a new version of the application and switch on traffic mirroring.

The old version responds to requests like before but also sends an asynchronous copy to the new version.

The new version processes the traffic but does not respond to the user.

The operations team monitor the new version and report any issues to the development team.

As the application processes live traffic, it helps the team uncover issues that they would typically not find in a pre-production environment. You can use monitoring tools, such as Prometheus and Grafana, for recording and monitoring your test results.

Additionally there is an example with nginx that perfectly shows how it should work.

It is worth mentioning that if you use write APIs, like order or payment, then mirrored traffic will mean write APIs like order multiple times. This topic is described in detail here by Christian Posta.

Let me know if there is something more you want to discuss.