Today, I have a guest contributor: I was working with Vidhya Bhushan, one of my Broadcom team members on our performance testing team. We had a discussion about performance testing for a specific use case and I asked for scaling of the test generation, and he commented that we might not need to since the test generator container was only showing 80% CPU usage.
My experience has been that 80% CPU usage is a maxed out container. So we tried a second test gen container, and indeed, we had nearly 50% more TPS than a single test gen container, and the test containers both now showed 70% CPU usage. In this example especially, we were benchmarking the load generator, not the system under test. As Vidhya said at the time, the bottleneck proved to be the load generator.
If this environment was bare metal, I would have been satisfied with the original test, but not in this case, simply because every level of abstraction gives us a new level of approximation and a new level of hidden delays that don’t show up as busy CPUs.
Lo these far too many years ago, I was finishing a degree in Electronics. At the time, the electronics curriculum was transitioning from filters and audio circuits to computers and digital electronics, so we spent time hand building modems and filters in our labs. We built crude CPU hardware components like shift registers and adder circuits. We etched and soldered circuit boards, and built things with breadboards using resistors and capacitors and chips. Vidhya has a degree in Electronics as well, so there was a lot of common ground.
In the intervening years, I’ve seen, written, implemented and deployed more and more levels of abstraction. But I never forget the smell of a burnt transistor from being installed backwards and the magic smoke coming out. I like to think that keeps me grounded in reality.
Compared to the instant feedback of burnt transistors (and fingers) and the precision of an 80’s era oscilloscope with nanosecond resolution, we have a lot less certainty with all of the levels of abstraction in use.
The test in question has so many abstractions; it was a user mode performance test, written in a meta-language (Jmeter) running in our garbage collected high level language (Java), running in containers, which are running within Kubernetes. Those are running on Docker, which in our case was running on VMware, and was running somewhere in a cloud data center which we didn’t have any real control of, and have abstract at best control on networking and very little knowledge of the minute to minute latencies in that network.
The way I heard it was that adding another level of abstraction solves every problem in computing – except for too many levels of abstraction.
As Vidya pointed out at the time, to determine the cost of the abstractions, the answer is as abstract as the question itself. We need to make a judgement call on how much resource underutilization is acceptable. In our quest to find the abstraction cost, we can for instance compare Loadgens against Throughput(per Loadgen) and determine when we get diminishing returns. The abstraction cost thus arrived at can be factored in other test models.
While I really like the tooling I get to use now – I have a lot less burnt fingers, and I love the scale that I command with very little effort, I miss the certainty.
I think there’s some things we can take away from this rambling. The first of which as soon as you introduce an abstraction, you need to then account for the loss of precision.
By way of example, in our example above, the loss of precision is at every level: GC overhead, the networking latency and scheduling unpredictability of Docker at the local level, the syscall overhead of VMWare hardware emulation, the unpredictability of networking latency in a remote network.
My rule of thumb is that at anything above 50% reported resource utilization – e.g. network, CPU, disk, you need to plan more scale tests with the aim to establish a confidence interval: Test as close as you can get to full utilization on some resource allocation, and then plan to double up or more on resources allocated without increasing your test parameters, so as to establish the behaviour when machine resources are more available. Especially dangerous are load generators showing as busy; you probably have the situation we did – we were really just benchmarking the load generators.
This of course can get silly around outrageous resource allocation, so sometimes just reducing your test parameters helps establish your confidence level. But testing does have to achieve throughput in the same order of magnitude as production throughput. As an engineer, I’d want to have a scaling plan that accommodates at least 2x current production peaks, but I’m a pretty conservative guy.
There is also an uncomfortable, but actually pretty useful other outcome you might encounter; you may discover that no amount of resource scaling on some individual resource improves your throughput, because you have hit a new bottleneck elsewhere. That’s why only testing at high resource utilization can leave you uninformed. As the saying goes, “And Knowing is Half The Battle”. Bottlenecks are kind of a given in most systems, and nearly all of them relate to some form of coupling
I once encountered a situation where the method of estimating peak performance consisted of reducing resource allocation, not increasing it. This made a naive assumption that is the flip side of the bottleneck points we made above; the investigators decided to shut down the scaled instances of one resource but left a very central, well known bottleneck in place.
The scaling choice they made meant that with the reduced number of load generating parts of the system, and the same common bottleneck, they took the observed behaviour in that state and attempted to compute that once the rest of the scaled resources were restored, the overall performance would scale linearly – beyond the limit of the bottleneck, a reality that eventually became apparent in production.
This group had assumed that the only bottleneck were the systems that they scaled down. Bottlenecks and scaling are kind of the inverse of each other. Predicting where you need to scale or optimize is a completely separate topic, but let me leave you with the rules of optimization, which are nearly the exact same rules as tuning: Don’t do it, yet, and if you still feel you need to, test first.
I note that just observing CPU usage especially can be tricky. You may be actually short of RAM, but in C# and Java, that shows up as CPU usage from garbage collection. I find this especially complex as the world of containers matures: many organizations are working with rules of thumb around CPU and memory and disk space that don’t seem to square with the realities of enterprise class scaling. Garbage collection based runtimes aren’t going away any time soon in the enterprise.
Working on both under-resourced and over-resourced scenarios can help you establish a performance “curve”. Usually when over-resourced, latency is lowest and user experience is best. On lowered resources, you get an idea about what happens in the worst case. That directly relates to another post I made a few months ago about planned acceptable error rate.
Obviously you can’t go to production over resourced in most corporate budgets, but knowing where things get “bad” is the most useful thing for planning, because often the 80/20 rule applies where you get 80% of the desired outcome with 20% of the resources allocated. But you need to know where that is; for example, based on the resource allocation, you might learn that once memory usage goes above 85% on some central component, no amount of additional load will result in significantly higher transaction rates, and mostly will result in worse latency. So you’d want to plan to add memory before you hit 85% utilization, as a result.
TL;DR: The more abstractions you have, the more levels of scaling you should evaluate before you can establish which resource(s) you need to scale.
My thanks to Vidhya for a great perspective on this very real problem.