logo

Never Go Limitless with Capacity Planning

Posted by Marbenz Antonio on October 17, 2022

6 Questions that will Improve Your Capacity Planning Strategy - Plex  DemandCaster

Red Hat all understand the value of creating boundaries, whether they are for our children, our diet, our physical activity, or anything else. However, Red Hat may believe, “Why to limit ourselves? If a resource is accessible, then give it to me,” when it comes to resources specific to our applications. However, it turns out that this approach, even in the world of seemingly limitless resources that the cloud offers, may not be the best one.

Red Hat launched a “resource hungry” application on OpenShift clusters of various sizes to verify the advised reservation of resources for system tasks. They are particularly interested in how clusters like Single Node OpenShift (SNO) clusters and Compact Clusters behave when system and application resources are shared (three nodes serving both as masters and workers).

The findings demonstrate that there are situations in which allowing an application to run limitlessly may result in the control plane becoming unresponsive to the point where the entire cluster—limitless application included—becomes unusable, regardless of how many resources are reserved for system tasks.

The conclusion from this is that as a cluster administrator, you should kindly request that the developers of the applications being deployed on the cluster(s) you are in charge of establish restrictions in the manifests for those applications. However, utilizing the Kubernetes mechanism designed for this purpose—setting resource quotas per namespace through the establishment of ResourceQuota objects—to enforce the use of resource constraints is significantly safer.

Motivation

Planning your capacity is important for getting the most out of your resources. It is not a simple task, though, and usually, they come up with a solution that is either too small or too large. It is their responsibility to provide an optimal allocation of resources to system tasks to keep the system functioning smoothly while leaving the vast majority of available resources to application workloads as we collaborate with our customers and partners on the design of OpenShift clusters to host their applications.

On one particular project, Red Hat was asked to verify their recommendations for allocating reserved cores to system workloads with a significant partner in the telco sector (four virtual cores in an SNO). In this experiment, they wanted to determine how well the system used its cores when running resource-demanding programs.

Fractals on demand

Red Hat launched a straightforward online application that creates fractals of various sizes to simulate a situation that is similar to a regular web application with changing needs for server resources. This application was chosen because it makes it very simple to adjust CPU, memory, and network usage through parameters in the URL query string (Figure 1).

The outcomes of requesting 200 x 200, 400 x 400, and 600 x 600-pixel fractals are displayed from left to right. The demand for resources grows as the size of the requested image increases. Large fractals demand a lot of computing power because double-precision floating point matrices are used in the computation. With all configuration options accessible as parameters in the URL query string, the fractal-generating code was wrapped as an HTTP server. You can access the resulting code here.

Figure 1: Fractals on demand
Figure 1: Fractals on demand

Not quite on demand…

Red Hat created a very straightforward web client that, given a URL and a number n, generates n threads that simultaneously access the URL to stress the web application and analyze CPU and memory use in the cluster. Then, 1500 clients approach the web application simultaneously and continually using this “parallel getter,” with 1000 of them requesting little photos (512 x 512 pixels), and the remaining requesting huge images (4096 x 4096 pixels).

Unexpectedly, the cluster becomes unresponsive and the “on-demand” fractal generator suddenly ceases operating until the infinite loop of concurrent web clients is ended. At such point, system workloads’ CPU utilization unexpectedly increases to levels that are significantly higher than what is ideal. The system starts to slow down a little after 18:36, as seen in Figure 2 (figure 2a, utilization for all workloads), but the maximum CPU usage by system workloads occurs after 18:40. (Figure 2b, utilization for system workloads only).

Figure 2: CPU utilization generated by multiple clients requesting fractal generation

Figure 2a: CPU utilization for all workloads
Figure 2a: CPU utilization for all workloads
Figure 2b: CPU utilization for system workloads only
Figure 2b: CPU utilization for system workloads only

It is uncertain what is generating this increase simply based on CPU utilization, but taking a look at memory utilization may provide some answers. The amount of memory available to programs is almost totally absorbed by the workloads of the apps at the “breaking point” (Figure 3).

Figure 3: Memory utilization for the same period
Figure 3: Memory utilization for the same period

Quotas for resources to the rescue

CRed Hat doesn’t want the clusters to stop functioning, regardless of the apps that are utilizing them. SNOs and compact clusters, where the control plane shares resources with the applications, require additional attention. In our experiment, the issue is brought on by a memory-hungry application that has no regard for limits and uses all available resources. A Kubernetes application that behaves itself should declare the minimum (resources.requests) and maximum (resources.limits) resource requirements per pod (in the spec.containers section). The cluster, however, is vulnerable to misuse because by default, these settings are optional and it is up to application developers to specify restrictions.

Fortunately, the ResourceQuota object specified in the official OpenShift documentation provides a mechanism that system administrators can utilize to force developers to set limits:

“A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per project. It can limit the number of objects that can be created in a project by type, as well as the total amount of computing resources and storage that might be consumed by resources in that project.”

Applications cannot be deployed after a ResourceQuota has been set for a project unless the limitations have been established and are within range. It should be noted that the ResourceQuota object limits the number of resources that may be consumed by a project. For this reason, the resources allotted to projects must not exceed the number of resources that are actually available to effectively ensure cluster stability. In our situation, available memory = total memory – system memory used.

It is also advised to keep memory use to no more than 90% of the available memory to be on the safe side (in case the system unexpectedly needs more memory).

Considering that the test SNO has 200 GBytes of total RAM and only utilizes about 15 GBytes of that (see Figure 4), enabling all apps to use up to 185 GBytes collectively in this cluster should keep it operating normally.

Figure 4: Memory used by system workloads
Figure 4: Memory used by system workloads

In light of this, Red Hat establish the following resource quota:

apiVersion: v1

kind: ResourceQuota

metadata:

 name: memory-quota

spec:

 hard:

   requests.memory: 185Gi

So, given that they only have one application project on that cluster and that it won’t expand beyond nine pods, as it did in the tests, simply add the following memory restriction to the application manifest:

 resources:

          limits:

            memory: 20Gi 

          requests:

            cpu: 1000m

With these restrictions in place, they once more started the infinite loop of 1500 concurrent web clients (as above). Now, as can be seen in Figure 5, the system is still responsive after more than 40 minutes of testing, and CPU usage from system tasks is still below the advised level.

Figure 5 shows the stressed CPU and memory usage when an application is running with memory limitations

Figure 5a: CPU utilization under stress
Figure 5a: CPU utilization under stress
Figure 5b: CPU utilization under stress by system workloads
Figure 5b: CPU utilization under stress by system workloads
Figure 5c: Memory utilization under stress
Figure 5c: Memory utilization under stress

Therefore, they can enforce setting limitations in application manifests thanks to ResourceQuota, which ultimately resulted in a stable cluster. What about the now-restricted applications, though? The program is still usable, as shown by the “Thread ### finished…:” lines in Figure 6’s screenshot, but some requests do fail.

This is to be expected and indicates that the application has used all of the resources allocated to it. The cluster needs extra resources to address the issue without interrupting the cluster, which would cause the application to fail.

Figure 6: Screenshot of the multi-threaded web client requesting images from our fractal generator under stress. As expected, some threads succeed, while others fail.
Figure 6: Screenshot of the multi-threaded web client requesting images from our fractal generator under stress. As expected, some threads succeed, while others fail.

Summary

Planning for capacity is important for creating a stable cluster as well as for the usability and dependability of the applications that operate on top of it. OpenShift clusters by default provide application developers with a lot of power and place no limitations on them. Without proper care, an application might bring down the entire cluster, especially in restricted environments like SNOs or Compact Clusters. But great power also comes with great responsibility.

The key lesson learned from this experiment is that it is always a good idea for cluster administrators to enforce restrictions by setting resource quotas on each OpenShift project. If you want to improve cluster stability in restricted environments like an SNO, it is not just a nice idea—it is a must.

 


Here at CourseMonster, we know how hard it may be to find the right time and funds for training. We provide effective training programs that enable you to select the training option that best meets the demands of your company.

For more information, please get in touch with one of our course advisers today or contact us at training@coursemonster.com

Verified by MonsterInsights