Plan to scale
When planning for scale, every tool out there has individual capacity limits. Many can scale horizontally, and many scale vertically, but ALL of them cost time, money, or both to scale.
Unbounded scaling is just not available for free. Even if the products in use attract no licenses fees, there are always costs: CPU, hosting, installation, complexity, build scaling, load distribution. Everything has a budget: there are only so many hours in a dev schedule and so many dollars in capital and ops budgets.
User and Budget friendliness
One way to look at budget friendly scaling is to look at the consequence of not scaling: errors, and how they affect the user experience. It has been our experience that in modern user experience, a fast error is far more preferred to a slow eventual timeout. This leads to a possible solution: What if we prefer to send a quick error over an eventual timeout? What
would that look like?
Concurrency is how your back-end scales
A common way that a many systems are scaled is via some form of concurrency. Old school monolithic servers, up to the minute microservice containers eventually all have a limit on how many concurrent requests they can serve before either the response time becomes unacceptably long, or timeouts occur on the request side.
Concurrency and latency are key in a larger TPS conversation
Managing scaling must pay careful attention to the queuing equation:
Transactions per second = (Concurrency) / (Latency in Seconds)
In practical terms, that means if your average latency of your requests to be serviced is 5 seconds, you need 500 outstanding requests to service just 100 requests per second, assuming the number of concurrent connections doesn’t also increase your average latency – rarely the case.
If you can drive your maximum and average latency downwards, you can have smaller concurrency and still get the transactions per second you need.
Smaller concurrency is generally better, for many reasons, including network resources, scaling at the external facing parts of infrastructure, including load balancers, and the back-end overhead of servicing so many concurrent requests.
High Concurrency can cause a runaway at the user experience level
Let me give a more concrete example: Assume a standard Java app server – nearly always configured with a maximum thread count. That means that with a configured thread count of 100, if you present 101 concurrent requests, the 101 st request MUST wait until one of the other threads are completed, creating a doubled latency for that request. If you present 200 concurrent requests, then your user-apparent latency is increased 50% at least, as half the requests are waiting on free threads.
If you also have users cancelling requests, the inbound retry of the request contributes to the load and concurrency, making the latency worse.
We can make a simplifying assumption that you have internal design that is compute bound, meaning that the number of physical cores on the machine (let’s say 16) determines the concurrency at which the real service latency starts to ramp up. That means that with 100 inbound requests, you have 84 requests waiting in the app, and 100 more waiting in the network stack and connector.
The reality of the situation usually has some level of external calls meaning some level of concurrent processing higher than the number of cores occurs. This moves the latency further back into your infrastructure, and does not make the problem disappear. Sometimes the limited resource is a database insert which ends up being only as fast as the disk will go, and
again, customer visible service latency starts to ramp up.
Without concurrency limits, user satisfaction starts to really vary based on the time of day – during peak times, the latency can approach and surpass the threshold of customer patience, and then a cancel/re-request happens, making the problem worse.
Browser timeouts are insanely long; Let’s talk about human timeouts.
The Firefox Web Browser standard request timeout for HTTP calls is 300 seconds. That’s well beyond anyone’s patience.
At 300 seconds, in some contexts, the information request – even should it finally get to the browser – can sometimes be no longer relevant, resulting in another request.
Timeouts are far sooner than you think
If the user apparent latency of API call isn’t capped at some sane amount, the overall effect is this will reduce the number of successful requests that complete – from the user’s perspective at least because more people will give up and cancel, around the 10 second mark, and your system will waste resources servicing requests that will never be delivered – still a failed
request from the user’s perspective.
There’s real benefit beyond scaling: If a client side request times out or is cancelled by the user, the semantics of how load balancers, the public internet and enterprise network infrastructure works means that the chances of the back-end server receiving any kind of actionable “Cancel”
are effectively zero.
This means that waiting the standard 60 to 120 seconds – especially with the impatience of end users in mind becomes questionable.
Think of it this way: customers normally stop paying attention to the user interface between 1 and 10 seconds, according to Nielsen research going back to the 60’s. At 10 seconds, you’ve lost their attention and increased their frustration.
Because nobody pays attention long enough
The common response from user nowadays is to wait a few seconds, lose patience, and then hit the cancel button, or close the app, or worse, the refresh button – and your infrastructure will continue to process the previous request that will never be received by the user PLUS the new
one – because the UI has discarded the request context.
This wastes many seconds of back end processing, concurrency in middle tiers and the user’s good will by allowing anything more than 10 seconds of response time in general.
The sooner you respond to the user, the better
But of course, shorter is better. We’d suggest that planning for sub 10 second interactions in general around user experience suggests that if an operation has an expectation of greater than 1 or 2 seconds, that you should explore asynchronous requests: It’s relatively easy to submit a
request, internally put the request on a queue, and send some data back so the client can present some UI to check progress. This will virtually eliminate duplicate requests.
Even for Batch
Even between machines, for effectively batch style operations, long latency still becomes questionable: it often implies that the workflow is stacking up many concurrent operations, causing systems to use their scheduling heavily, wasting time managing the process state.
It is better for many reasons to not risk a failed client side timeout.
Timeout early, timeout often
Our experience is that timeouts in the client software is a failure case that we must avoid, and instead plan for a customer visible error as a viable response, by managing timeout and concurrency actively.
Use infrastructure for UX
This means that API gateways and similar infrastructure tools can be used to improve your success rate with hard limits on latency and concurrency.
If you limit concurrency and maximum latency to a level your tool chain can handle and reject requests above that with a “too busy” message – perhaps via HTTP 503, 504 or some other strong signal, then you can plan for a user experience that has fewer wasted requests, and on
average more successful client-visible requests.
This probably also means you can get a higher “customer perceived fully successful request” rate, even in high volume, under-powered periods, and respond better under heavy load – because it the infrastructure would prevent crash-prone overload conditions on back end
Planned acceptable error rate
Assuming you want to provide a level of service – say it’s some number concurrent users, or it’s a transaction per second rate. Using the above thinking, we can start creating a service level that is defined in terms of maximum latency, and acceptable error responses.
During unexpected scaling events like a DDOS, by putting maximum limits on concurrency, you can reduce the amount of traffic that needs to be considered by back end servers.
It is far easier to plan for acceptable user experience if the requests don’t time out at the browser, but instead terminate in infrastructure. Lengthy requests don’t tie up external interfaces if you use the asynchronous design specified. In mobile, thanks to the changing nature of network requests, this can lead to a better user experience, independent of the
API gateways can be used to improve your user experience by setting maximum concurrency and latency. Setting concrete, user friendly error conditions as part of the API contract is the way to get there.