I'm confused with the relationship between two parameters: requests
and cpu.shares
of the cgroup which is updated once the Pod is deployed. According the readings I've done so far, cpu.shares
reflects some kind of priority when trying to get the chance to consume the CPU. And it's a relative value.
So my question why kubernetes considers the request
value of the CPU as an absolute value when scheduling? When it comes to the CPU processes will get a time slice to get executed based on their priorities (according to the CFS mechanism). To my knowledge, there's no such thing called giving such amounts of CPUs (1CPU, 2CPUs etc.). So, if the cpu.share
value is considered to prioritize the tasks, why kubernetes consider the exact request value (Eg: 1500m, 200m) to find out a node?
Please correct me if I've got this wrong. Thanks !!
Answering your questions from the main question and comments:
So my question why kubernetes considers the
request
value of the CPU as an absolute value when scheduling?
To my knowledge, there's no such thing called giving such amounts of CPUs (1CPU, 2CPUs etc.). So, if the
cpu.share
value is considered to prioritize the tasks, why kubernetes consider the exact request value (Eg: 1500m, 200m) to find out a node?
It's because decimal CPU values from the requests are always converted to the values in milicores, like 0.1 is equal to 100m which can be read as "one hundred millicpu" or "one hundred millicores". Those units are specific for Kubernetes:
Fractional requests are allowed. A Container with
spec.containers[].resources.requests.cpu
of0.5
is guaranteed half as much CPU as one that asks for 1 CPU. The expression0.1
is equivalent to the expression100m
, which can be read as "one hundred millicpu". Some people say "one hundred millicores", and this is understood to mean the same thing. A request with a decimal point, like0.1
, is converted to100m
by the API, and precision finer than1m
is not allowed. For this reason, the form100m
might be preferred.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
Based on the above one, remember that you can specify to use let's say 1.5 CPU of the node by specifying cpu: 1.5
or cpu: 1500m
.
Just wanna know lowering the
cpu.share
value in cgroups (which is modified by k8s after the deployment) affects to the cpu power consume by the process. For an instance, assume that A, B containers have 1024, 2048 shares allocated. So the available resources will be split into 1:2 ratio. So would it be the same as if we configure cpu.share as 10, 20 for two containers. Still the ratio is 1:2
Let's make it clear - it's true that the ratio is the same, but the values are different. 1024 and 2048 in cpu.shares
means cpu: 1000m
and cpu: 2000m
defined in Kubernetes resources, while 10 and 20 means cpu: 10m
and cpu: 20m
.
Let's say the cluster nodes are based on Linux OS. So, how kubernetes ensure that request value is given to a container? Ultimately, OS will use configurations available in a cgroup to allocate resource, right? It modifies the
cpu.shares
value of the cgroup. So my question is, which files is modified by k8s to tell operating system to give100m
or200m
to a container?
Yes, your thinking is correct. Let me explain in more detail.
Generally on the Kubernetes node there are three cgroups under the root cgroup, named as slices:
The k8s uses
cpu.share
file to allocate the CPU resources. In this case, the root cgroup inherits 4096 CPU shares, which are 100% of available CPU power(1 core = 1024; this is fixed value). The root cgroup allocate its share proportionally based on children’scpu.share
and they do the same with their children and so on. In typical Kubernetes nodes, there are three cgroup under the root cgroup, namelysystem.slice
,user.slice
, andkubepods
. The first two are used to allocate the resource for critical system workloads and non-k8s user space programs. The last one,kubepods
is created by k8s to allocate the resource to pods.
To check which files are modified we need to go to the /sys/fs/cgroup/cpu
directory. Here we can find directory called kubepods
(which is one of the above mentioned slices) where all cpu.shares
files for pods are here. In kubepods
directory we can find two other folders - besteffort
and burstable
. Here is worth mentioning that Kubernetes have a three QoS classes:
Each pod has an assigned QoS class and depending on which class it is, the pod is located in the corresponding directory (except guaranteed, pod with this class is created in kubepods
directory).
For example, I'm creating a pod with following definition:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
spec:
selector:
matchLabels:
app: test-deployment
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: test-deployment
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
cpu: 300m
- name: busybox
image: busybox
args:
- sleep
- "999999"
resources:
requests:
cpu: 150m
Based on earlier mentioned definitions, this pod will have assigned Qos class Burstable
, thus it will be created in the /sys/fs/cgroup/cpu/kubepods/burstable
directory.
Now we can check cpu.shares
set for this pod:
user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90 $ cat cpu.shares
460
It is correct as one pod is taking 300m and the second one 150m and it's calculated by multiplying 1024. For each container we have sub-directories as well:
user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/fa6194cbda0ccd0b1dc77793bfbff608064aa576a5a83a2f1c5c741de8cf019a $ cat cpu.shares
153
user@cluster /sys/fs/cgroup/cpu/kubepods/burstable/podf13d6898-69f9-44eb-8ea6-5284e1778f90/d5ba592186874637d703544ceb6f270939733f6292e1fea7435dd55b6f3f1829 $ cat cpu.shares
307
If you want to read more about Kubrenetes CPU management, I'd recommend reading following: