Our client, an international fashion retailer with over 1000 stores operating in 65 countries, decided to improve its customer experience by introducing a loyalty application on a massive scale.
In this case study, we present how we used Open Loyalty and Amazon Web Services (AWS) to deliver a future-proof technological base for this project - easily scalable and ready to handle millions of customers.
- Over 1500 concurrent API calls with response time under 1 second
- Easy scalability with Kubernetes and AWS infrastructure
The challenge - A customer loyalty system ready for massive traffic
Retail loyalty programs need to handle a huge amount of customer interactions like points calculation, segmentation or levels calculation. Scalability and performance are extremely important - especially for all kinds of seasonal peaks of transactions.
"Global brands know that the performance of a loyalty program is not only about the number of customers in the database. From a technical point of view, it’s more about the number of concurrent API calls and easy scalability of the system."
CTO, Open Loyalty
For our client, leading fashion retailers from North America, the biggest challenges for their loyalty program laid exactly in the performance of the system. To find out the right solution for their business, we conducted a Proof of Concept.
The key goals for the Proof of Concept:
- The loyalty software needs to handle at least 1500 concurrent calls at the same time with a response time around one second for the 90th percentile
- The infrastructure must be easily scalable horizontally, so the client can optimize costs and handle traffic peaks (e.g. on Holiday Season or Black Friday)
- The loyalty software must operate on AWS and needs to use all its capabilities
The process - Confirming the loyalty software with a Proof of Concept
Knowing all the key goals we started working on the Proof of Concept, throughout which we wanted to confirm that Open Loyalty is capable of handling all the client’s requirements.
The work was conducted in iterations and let us deliver all necessary improvements in a short time. During the first iteration, we created the infrastructure with some estimated resources based on our knowledge and experience. We made some initial performance tests to see where our assumed solution is in terms of the speed and scalability of the system.
Based on the test’s results, we considered how we can speed up our software and where the performance bottlenecks are. Next, we made some improvements and went with another round of tests. This process was repeated a few times. The iterative measure-analysis-changes approach allowed us to fully control the development process.
"For the purpose of testing, we generated 500k customers. Sample data was created by using built-in tools of Open Loyalty. Then, by using Apache jMeter and distributing tests for 3 different servers, we performed tests for 1500 concurrent calls. The test took one hour, with ramp-up set to half an hour. We tested three different API endpoints with exported CSV data to simulate calls for different customers."
CTO, Open Loyalty
Creating a loyalty program's infrastructure on AWS
For the purpose of this project, we created the infrastructure of this loyalty system on Amazon Web Services (AWS). To achieve better scalability (Open Loyalty is capable of running on K8s) we decided to use AWS Elastic Kubernetes Service (EKS).
Open Loyalty software is designed to work with separate sources to read and write of the data. We’re using PostgreSQL for writes and Elasticsearch for reads. To compliment this, we chose the AWS counterparts - RDS for databases and Amazon Elasticsearch Service for data processing. For storage, we went with S3 for storage, and for sending emails with SES.
Boosting the system performance of the loyalty program
Knowing that our software and infrastructure meets all requirements, we ran more tests against it and looked for additional areas to boost the performance of this system. It took us four iterations of measure-analysis-changes to get results we were truly proud of. In every iteration, we made some improvements and resolved existing bottlenecks. Here are the most significant milestones we reached.
Bottleneck 1 - enlarging capacity of RDS
While testing, we found out that the EC2 (Amazon Elastic Compute Cloud) instance for RDS (Relational Database Service) was way too small. The reason for this was rather surprising. The machine we chose was good enough regarding CPU and memory, but it was not enough if we wanted to handle 1500 concurrent API calls. It turned out that the EC2 instance that we chose had a limit of 600 of concurrent connections.
We decided to change the database instance to the bigger one, with the capability of 4 times as many concurrent connections. The final configuration of the machine was db.m5.2xlarge 8 vCPU, 32 GiB RAM. As a result, we were able to handle almost 2500 concurrent connections to the database.
Bottleneck 2 - Lowering the CPU usage
The second bottleneck we detected concerned Elasticsearch. Data nodes were under high CPU usage. On the other hand, the CPU usage of the master nodes was around a few percent. The solution to this situation was similar to the one with RDS. We decided to change the types of instances (used for the data nodes) for higher performance ones. At the same time, we lowered the number of instances for the master nodes. We also increased the number of data nodes from 4 to 6. Finally, we got a 3 master node for multi-zone availability with t2.medium.elasticsearch and 6 data nodes with r5.large.elasticsearch.
Bottleneck 3 - Solving the problem of errors produced by a huge number of queries
The next thing we detected was a tremendous amount of queries to Elasticsearch. The original result for the Search Thread Pool with 1k search requests was exceeded by a factor of ten. This caused a lot of 504 Bad Gateway errors. The requests simply waited too long for a response, which exceeded the 60-second timeout set on the gateway. In this case, we investigated two different ways to fix it. The simplest one was to increase the number of data nodes and/or instances so we could have more search workers. With that approach, we would be able to handle more search requests.
The second way was to optimize the source code of the application to reduce the number of search queries to Elasticsearch. After discussing the pros and cons we decided to optimize the code, which resulted in a lowering of the costs of the infrastructure.
The reduced number of queries to Elasticsearch was achieved by refactoring of the existing code and by introducing a cache layer using Redis. For Redis, we used AWS ElasticCache in configuration with 1 master and 2 replicas using cache.m5.large instances. We decided to use Redis as it was already implemented in Open Loyalty for the whole Symfony cache. For future implementations, we will consider using Varnish, which should improve performance even more.
Bottleneck 4 - Solving the problem with ASW EKS and CoreDNS PODs
During one of the performance tests, we detected a problem inside ASW EKS, with CoreDNS PODs. It appeared as an issue resulting in a lot of 500s (almost 10% of the requests were classified as errors). We investigated our logs and found errors related to the DNS that say “could not translate host name "myserver_rds_instance.zone.rds.amazonaws.com" to address: Name or service not known“. It turned out it was a problem inside CoreDNS. By default, AWS EKS installs two instances of CoreDNS set to 100m requests per CPU. Under tests, those PODs were under heavy CPU usage so they couldn’t resolve hostnames. Increasing the number of CoreDNS PODs to the number of nodes in the cluster was the solution. At the same time, we increased cache TTL to 60 seconds.
Analyzing our loyalty system in the final performance test
After 4 weeks of constant testing and making a lot of performance improvements, we reached the desired results. The final configuration that allowed us to achieve the defined results was composed of 14 nodes with an m5.large instance and configured autoscaling CA for 20 nodes (when HPA detected average usage of 60% for all PHP PODs in the cluster).
The performance was assessed by 3 key operations: validate customer status, get customer loyalty profile and get customer transactions. The goal was to reach 1500 concurrent calls - at the same time - with a response time around one second for the 90th percentile.
The final results of the test that confirmed Open Loyalty’s performance capabilities:
All resources provided by the AWS infrastructure were used 100%. With Open Loyalty’s scalable architecture, along with AWS EKS, the performance can be easily improved. If the loyalty program needs to handle more traffic or provide a shorter response time, you simply need to scale an existing cluster by adding more nodes (EC2 instances) and PHP PODs.
The results - Loyalty system that handles 1500 concurrent API calls/s
The Proof of Concept and our series of tests confirmed that Open Loyalty set up on the AWS infrastructure handles extreme business cases. Thanks to the close cooperation between Open Loyalty’s core team and client’s IT department, we were able to solve all key technical challenges defined for the Proof of Concept. The approach of measure-analyze-change was successful and get us to the desired point pretty fast. In the end, it was all a matter of finding bottlenecks and fixing them one by one.
The key results achieved in the project were:
- 1500 concurrent users capability of Open Loyalty based on AWS
- Easy and quick scalability of the system
- Endurance for traffic peaks eg. the peak driven by Black Friday
We are continuing to work on Open Loyalty’s performance to make our technology the best choice for loyalty programs on a massive scale.
If you want to know more about our approach and learnings, do not hesitate to contact us.