Search code examples
kubernetescpucgroups

What is the Exact use of requests in Kubernetes


I'm confused with the relationship between two parameters: requests and cpu.shares of the cgroup which is updated once the Pod is deployed. According the readings I've done so far, cpu.shares reflects some kind of priority when trying to get the chance to consume the CPU. And it's a relative value.

So my question why kubernetes considers the request value of the CPU as an absolute value when scheduling? When it comes to the CPU processes will get a time slice to get executed based on their priorities (according to the CFS mechanism). To my knowledge, there's no such thing called giving such amounts of CPUs (1CPU, 2CPUs etc.). So, if the cpu.share value is considered to prioritize the tasks, why kubernetes consider the exact request value (Eg: 1500m, 200m) to find out a node?

Please correct me if I've got this wrong. Thanks !!


Solution

  • Answering your questions from the main question and comments:

    So my question why kubernetes considers the request value of the CPU as an absolute value when scheduling?

    To my knowledge, there's no such thing called giving such amounts of CPUs (1CPU, 2CPUs etc.). So, if the cpu.share value is considered to prioritize the tasks, why kubernetes consider the exact request value (Eg: 1500m, 200m) to find out a node?

    It's because decimal CPU values from the requests are always converted to the values in milicores, like 0.1 is equal to 100m which can be read as "one hundred millicpu" or "one hundred millicores". Those units are specific for Kubernetes:

    Fractional requests are allowed. A Container with spec.containers[].resources.requests.cpu of 0.5 is guaranteed half as much CPU as one that asks for 1 CPU. The expression 0.1 is equivalent to the expression 100m, which can be read as "one hundred millicpu". Some people say "one hundred millicores", and this is understood to mean the same thing. A request with a decimal point, like 0.1, is converted to 100m by the API, and precision finer than 1m is not allowed. For this reason, the form 100m might be preferred.

    CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

    Based on the above one, remember that you can specify to use let's say 1.5 CPU of the node by specifying cpu: 1.5 or cpu: 1500m.

    Just wanna know lowering the cpu.share value in cgroups (which is modified by k8s after the deployment) affects to the cpu power consume by the process. For an instance, assume that A, B containers have 1024, 2048 shares allocated. So the available resources will be split into 1:2 ratio. So would it be the same as if we configure cpu.share as 10, 20 for two containers. Still the ratio is 1:2

    Let's make it clear - it's true that the ratio is the same, but the values are different. 1024 and 2048 in cpu.shares means cpu: 1000m and cpu: 2000m defined in Kubernetes resources, while 10 and 20 means cpu: 10m and cpu: 20m.

    Let's say the cluster nodes are based on Linux OS. So, how kubernetes ensure that request value is given to a container? Ultimately, OS will use configurations available in a cgroup to allocate resource, right? It modifies the cpu.shares value of the cgroup. So my question is, which files is modified by k8s to tell operating system to give 100m or 200m to a container?

    Yes, your thinking is correct. Let me explain in more detail.

    Generally on the Kubernetes node there are three cgroups under the root cgroup, named as slices:

    The k8s uses cpu.share file to allocate the CPU resources. In this case, the root cgroup inherits 4096 CPU shares, which are 100% of available CPU power(1 core = 1024; this is fixed value). The root cgroup allocate its share proportionally based on children’s cpu.share and they do the same with their children and so on. In typical Kubernetes nodes, there are three cgroup under the root cgroup, namely system.slice, user.slice, and kubepods. The first two are used to allocate the resource for critical system workloads and non-k8s user space programs. The last one, kubepods is created by k8s to allocate the resource to pods.

    To check which files are modified we need to go to the /sys/fs/cgroup/cpu directory. Here we can find directory called kubepods (which is one of the above mentioned slices) where all cpu.shares files for pods are here. In kubepods directory we can find two other folders - besteffort and burstable. Here is worth mentioning that Kubernetes have a three QoS classes:

    Each pod has an assigned QoS class and depending on which class it is, the pod is located in the corresponding directory (except guaranteed, pod with this class is created in kubepods directory).

    For example, I'm creating a pod with following definition:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-deployment
    spec:
      selector:
        matchLabels:
          app: test-deployment
      replicas: 2 # tells deployment to run 2 pods matching the template
      template:
        metadata:
          labels:
            app: test-deployment
        spec:
          containers:
          - name: nginx
            image: nginx:1.14.2
            ports:
            - containerPort: 80
            resources:
              requests:
                cpu: 300m
          - name: busybox
            image: busybox
            args:
            - sleep
            - "999999"
            resources:
              requests:
                cpu: 150m
    

    Based on earlier mentioned definitions, this pod will have assigned Qos class Burstable, thus it will be created in the /sys/fs/cgroup/cpu/kubepods/burstable directory.

    Now we can check cpu.shares set for this pod:

    user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90 $ cat cpu.shares
    460
    

    It is correct as one pod is taking 300m and the second one 150m and it's calculated by multiplying 1024. For each container we have sub-directories as well:

    user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/fa6194cbda0ccd0b1dc77793bfbff608064aa576a5a83a2f1c5c741de8cf019a $ cat cpu.shares
    153
    user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/d5ba592186874637d703544ceb6f270939733f6292e1fea7435dd55b6f3f1829 $ cat cpu.shares
    307
    

    If you want to read more about Kubrenetes CPU management, I'd recommend reading following: