Architect Your AWS SQS Application For Performance and Cost

This is the follow up post to the article I previously wrote called "Ouch! SQS Left A Hole In Your Pocket. So What's Next?".

A few years ago, I was assigned to investigate an increase in our AWS SQS service cost. While relying solely on native cloud cost tools or third-party cloud cost optimization tools can provide some insight into debugging your cloud costs, their capabilities have limitations. Typically, these tools are good at answering the "What" aspect of the costs. To address the cost you also need the answer to the "Why" and "How" questions. Which may entail understanding the application architecture, basic knowledge of how AWS services work and interact with other services in your ecosystem.

So Why Did It Happen?

To answer the "Why" question we need to first know about SQS polling types and one of the factor contributing to the service cost.

Amazon SQS offers two types of polling mechanisms: short polling and long polling, for receiving messages from a queue. By default, queues utilize short polling. With short polling, the ReceiveMessage API request queries a subset of the servers to locate available messages for inclusion in the response. Amazon SQS promptly delivers the response, regardless of whether the query discovered any messages. This means you may get an empty response from SQS with short polling.

The pricing of SQS depends on various factors, and one of those factors is the API interaction. This means empty responses from the SQS contribute to the total API call count which contributes to the SQS cost. Such calls are tracked using the SQS NumberOfEmptyReceives metric in CloudWatch.

The majority of the cost in our case was attributed to the large number of EmptyReceive responses. This pattern was caused by the service's unoptimized number of application threads. In comparison to the number of messages arriving at the SQS, too many threads per service were configured. This meant that a large proportion of threads were receiving empty responses.

How To Fix It?

There are various strategies to address the high number of EmptyReceive responses. The best strategy will be determined by your workload. Here are some of the strategy you can employ if you identify high number of EmptyReceive responses.

Long Polling: Switch from Short polling to Long polling. When this option is enabled, Amazon SQS responds after collecting at least one available message, up to the maximum number of messages given in the request. This means that the application threads will be held for longer until they receive a response. Note: This will require configuring the right thread timeout value for the application's performance and stability.
Application Threads: Depending on the workload, the number of application threads will need to be adjusted to reduce the number of empty responses while maintaining optimal application performance and service costs. For example, in the diagram below, instead of having three application threads in the thread pool for service 1, you reduce the number of threads to two.

No alt text provided for this image

# of Applications/Clients connection: Another strategy is to examine the number of clients who are connected to the SQS. For example, if you have three application servers running same service in a cluster, each of which has three threads communicating with SQS, reducing the number of clients by one reduces the number of threads by 3. However, if you want to maintain the same number of clients, you must consider optimising the number of application threads. In the diagram below, for example, instead of three instances of the same service running, there are only two.

No alt text provided for this image

We had to combine the first two strategies for optimal performance and cost efficiency. Multiple rounds of performance testing were also carried out in order to achieve the best configuration for our workload.

The following example shows two different real-world workloads, one with short polling and the other with long polling configuration. After optimising the short polling workload, we were able to cut our SQS costs in half.

The first graph employs short polling, while the second employs long polling.

No alt text provided for this image

A high NumberOfEmptyReceives count indicates that your application is not properly configured with AWS SQS. It is an opportunity to review your application design for performance and cost.

------------------------------------------------------------------------------------

Thanks for reading!

If you enjoyed this article feel free to share on social media 🙂

Say Hello on: Linkedin | Twitter | Polywork

Github repo: hseera

Architect Your AWS SQS Application For Performance and Cost

So Why Did It Happen?

How To Fix It?

Did you find this article valuable?